The 46th Annual IEEE/ACM International Symposium on Microarchitecture, 2013

MICRO-46 Session 3A - Main Memory

RowClone: Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization

Vivek Seshadri (CMU)
Yoongu Kim (CMU)
Chris Fallin (CMU)
Donghyuk Lee (CMU)
Rachata Ausavarungnirun (CMU)
Gennady Pekhimenko (CMU)
Yixin Luo (CMU)
Onur Mutlu (CMU)
Phillip B. Gibbons (Intel)
Michael A. Kozuch (Intel)
Todd C. Mowry (CMU)

Lightning session talk: PDF, Presentation: PDF, Poster: PDF, Full Paper: DOI 10.1145/2540708.2540725

Abstract:
Several system-level operations trigger bulk data copy or initialization. Even though these bulk data operations do not require any computation, current systems transfer a large quantity of data back and forth on the memory channel to perform such operations. As a result, bulk data operations consume high latency, bandwidth, and energy�degrading both system performance and energy efficiency.

In this work, we propose RowClone, a new and simple mechanism to perform bulk copy and initialization completely within DRAM � eliminating the need to transfer any data over the memory channel to perform such operations. Our key observation is that DRAM can internally and efficiently transfer a large quantity of data (multiple KBs) between a row of DRAM cells and the associated row buffer. Based on this, our primary mechanism can quickly copy an entire row of data from a source row to a destination row by first copying the data from the source row to the row buffer and then from the row buffer to the destination row, via two back-to-back activate commands. This mechanism, which we call the Fast Parallel Mode of RowClone, reduces the latency and energy consumption of a 4KB bulk copy operation by 11.6× and 74.4×, respectively, and a 4KB bulk zeroing operation by 6.0× and 41.5×, respectively. To efficiently copy data between rows that do not share a row buffer, we propose a second mode of RowClone, the Pipelined Serial Mode, which uses the shared internal bus of a DRAM chip to quickly copy data between two banks. RowClone requires only a 0.01% increase in DRAM chip area.

We quantitatively evaluate the benefits of RowClone by focusing on fork, one of the frequently invoked system calls, and five other copy and initialization intensive applications. Our results show that RowClone can significantly improve both single-core and multi-core system performance, while also significantly reducing main memory bandwidth and energy consumption.