Abstract:
This paper investigates partitioning the ways of a shared last-level cache among the threads of a symmetric data-parallel application running on a chip-multiprocessor. Unlike prior work on way-partitioning for unrelated threads in a multiprogramming workload, the domain of multithreaded programs requires both throughput and fairness. Additionally, our workloads show no obvious thread differences to exploit: program threads see nearly identical IPC and data reuse as they progress (as expected for a well-written load-balanced data-parallel program).
Despite the balance and symmetry among threads, this paper shows that a balanced partitioning of cache ways between threads is suboptimal. Instead, this paper proposes a strategy of temporarily imbalancing the partitions between different threads to improve cache utilization by adapting to the locality behavior of the threads as captured by dynamic set-specific reuse-distance (SSRD). Cumulative SSRD histograms have knees that correspond to different important working sets; thus, cache ways can be taken away from a thread with only minimal performance impact if that thread is currently operating far from a knee. Those ways can then be given to a single "preferred" thread to push it over the next knee. The preferred thread is chosen in a round-robin fashion to ensure balanced progress over the execution. The algorithm also effectively handles scenarios where an unpartitioned cache might outperform any sort of explicit partitioning. This dynamic partition imbalance algorithm allows up to 44% reduction in execution time and 91% reduction in misses over an unpartitioned shared cache for 9 benchmarks from the PARSEC-2.0 and SPEC OMP suites.