Effective Instruction Scheduling Techniques for an Interleaved Cache
Clustered VLIW Processor
Authors:
Enric Gibert (UPC Barcelona)
Jesús Sánchez (UPC Barcelona + Intel Barcelona Research Center)
Antonio González (UPC Barcelona + Intel Barcelona Research Center)
Abstract:
Clustering is a common technique to overcome the wire delay problem
incurred by the evolution of technology. Fully-distributed
architectures, where the register file, the functional units and the
data cache are partitioned, are particularly effective to deal with
these constraints and besides they are very scalable. In this paper
effective instruction scheduling techniques for a clustered VLIW
processor with a word-interleaved cache are proposed. Such scheduling
techniques rely on: (i) loop unrolling and variable alignment to
increase the percentage of local accesses, (ii) a latency assignment
process to schedule memory operations with an appropriate latency and
(iii) different heuristics to assign instructions to clusters. In
particular, the number of local accesses is increased by more than 25%
if these techniques are used and the ratio of stall time over compute
time is small.
Next, the main source of remote accesses and stall time is investigated.
Stall time is mainly due to remote hits, and Attraction Buffers are used
to increase local accesses and reduce stall time. Stall time is reduced
by 29% and 34% depending on the scheduling heuristic. IPC results for a
word-interleaved cache clustered VLIW processor are similar to those of
the multiVLIW (a cache-coherent clustered processor with a more complex
hardware design), and are 10% and 5% better (depending on the scheduling
heuristic) than the IPC for a clustered processor with a unified cache.