Abstract:
Many future heterogeneous systems will integrate CPUs and GPUs physically on a single chip and logically connect them via shared memory to avoid explicit data copying. Making this shared memory coherent facilitates programming and fine-grained sharing, but throughput-oriented GPUs can overwhelm CPUs with coherence requests not well-filtered by caches. Meanwhile, region coherence has been proposed for CPU-only systems to reduce snoop bandwidth by obtaining coherence permissions for large regions.
This paper develops Heterogeneous System Coherence (HSC) for CPU-GPU systems to mitigate the coherence bandwidth effects of GPU memory requests. HSC replaces a standard directory with a region directory and adds a region buffer to the L2 cache. These structures allow the system to move bandwidth from the coherence network to the high-bandwidth direct-access bus without sacrificing coherence.
Evaluation results with a subset of Rodinia benchmarks and the AMD APP SDK show that HSC can improve performance compared to a conventional directory protocol by an average of more than 2× and a maximum of more than 4.5×. Additionally, HSC reduces the bandwidth to the directory by an average of 94% and by more than 99% for four of the analyzed benchmarks.