

VLSI architecture, synthesis & technology



# **Democratizing Customized Computing**

Jason Cong

Volgenau Chair for Engineering Excellence, UCLA Computer Science Director, Center for Domain-Specific Computing (CDSC) <u>https://vast.cs.ucla.edu/people/faculty/jason-cong</u>

## **From Parallization to Customization**



Original source: Shekhar Borkar, Intel

... to look beyond parallelization and focus on domain-specific customization to bring significant power-performance efficiency ...

10/5/22

#### Customized Computing has been Our Research Focus Since 2009



#### Synthesis Lectures on **Computer Architectures**, 2015

#### INVITED PAPER

#### Customizable Computing— From Single Chip to Datacenters

By JASON CONG<sup>0</sup>, Fellow IEEE, ZHENMAN FANG<sup>0</sup>, Member IEEE, MUHUAN HUANG, PENG WEI, DI WU, AND CODY HAO YU

providing an elegian sourcom wine to computing the terms of ter this is better suited for energy-constrained designs where the silicon resource is abundant and spatial computing is favored—which has been the case with the end of Dennard I. INTRODUCTION scaling. Currently, customizable computing has garnered great Since the introduction of the microprocessor in 1971,

ABSTRACT | Since its establishment in 2009, the Center important research dimension enables automation for cus for Domain-Specific Computing (CDSC) has focused on tomized computing. This includes automated compilation for ing systems will be customizable with extensive use of the site of accelerators, as custom-designed accelerators often provide 10-100X performance/energy efficiency over the general-transparent resource management for integration of FPGAs for purpose processors. Such an accelerator-rich architecture datacenter-scale acceleration with support to the existing propresents a fundamental departure from the classical von Neu- gramming interfaces, such as MapReduce, Hadoop, and Spark mann architecture, which emphasizes efficient sharing of the for large-scale distributed computation. We will present the executions of different instructions on a common pipeline, latest progress in these areas, and also discuss the challenges providing an elegant solution when the computing resource and opportunities ahead.

interest: for example, this is evident by Intel's 117 billion the improvement of processor performance in its first acquisition of Altera in 2015 and Amazon's introduction of field 30 years was largely driven by the Dennard scaling of programmable gate-arrays (FPGAs) in its AWS public cloud. transistors [1]. This scaling calls for reduction of transistor In this paper, we present an overview of the research pro-grams and accomplishments of CDSC on customizable com-years) while keeping electric fields constant everywhere puting, from single chip to server node and to datacenters, in the transistor to maintain reliability (which implies with extensive use of composable accelerators and FPGAs. that the supply voltage needs to be reduced by 30% as We highlight our successes in several application domains, well in each generation). Such scaling not only doubles we infinite our soccesses in several approaches ournams, we in reach generation, ouch saming not only doubles such as medical imaging, machine learning, and computational the transistor density each generation and reduces the genomics. In addition to architecture innovations, an equal transistor delay by 30%, but also at the same time improves the nower by 50% and energy by 65% [2]. The increase

#### **Proceedings of IEEE, 2019**

IEEE Design & Test, 2011

# **Successful Examples of Customization**

#### • Example:

- Google TPU (Tensor Processing Unit)
- First version: 2014
- Revised TPU (2017), for training and inference
  - DRAM, 2 DDR3 -> GDDR5, 34GB/s -> 180GB/s
  - 200x perf/W of Haswell CPU, 70x perf/W of K80 GPU Based on data in [ISCA2017]
- Limitations:
  - Too costly for individuals (or small companies) to design
  - Take too much time to build



|  |                          | for L | Jnified Buffe<br>ocal Activati<br>256x8b = 24<br>29% of chip | Matrix Multiply Unit<br>(256x256x8b=64K MAC)<br>24% |                                    |                   |
|--|--------------------------|-------|--------------------------------------------------------------|-----------------------------------------------------|------------------------------------|-------------------|
|  | D<br>R<br>A<br>M<br>port |       | Host<br>Interf. 2%                                           |                                                     | Accumulators<br>56x32b = 4 MiB) 6% | D<br>R<br>A       |
|  |                          |       | Control 2%                                                   |                                                     | tion Pipeline 6%                   | M<br>port<br>ddr3 |
|  | ddr3<br>3%               |       |                                                              | %                                                   | Misc. I/O 1%                       | aar3<br>3%        |

331 mm2 ( < CPU: Haswell 18 core, 662 mm2 ) 75 TDP Watts ( < CPU: 145) 91.8 Peak TOPS/chip (CPU: 2.6 TOPS)

Google TPU: In-Datacenter Performance Analysis of a Tensor Processing Unit, ISCA 2017

#### **Customized Computing on FPGAs:** Example: Scalable Sorting [ISCA 2020]

- Bonsai: Adaptive merge tree sort solution (compute and I/O optimal) ٠
  - Optimized configuration of merge sort kernel for different memory configurations
  - Best DRAM-scale sorting performance
  - Scale to TB sorting via reconfiguration



Use of FPGAs trade-off performance for design cost, flexibility, and time-to-silicon 10/5/22

5

# Power of Customization (Domain-Specific Accelerators)

- Special Data Types and Operations
  - Do in 1 cycle what normally takes 10s or 100s 10-1000x efficiency gain

#### Most significant on ASIC (if one can afford cost and time) Still very substantial speedup on FPGAs despite its overhead

#### DSAs gain efficiency from specialization and performance from parallelism.

BY WILLIAM J. DALLY, YATISH TURAKHIA, AND SONG HAN

#### Domain-Specific Hardware Accelerators

FROM THE SIMPLE embedded processor in your washing machine to powerful processors in data center servers. most computing today takes place on general-purpose programmable processors or CPUs. CPUs are attractive because they are easy to program and because large code basesexist for them. The programmability of CPUs stems from their execution of sequences of simple instructions, such as ADD or BRANCH; however, the energy required to fetch and interpret an instruction is 10× to 4000× more than that required to perform a simple operation such as ADD. This high overhead was acceptable when processor performance and efficiency were scaling according to Moore's Law.32 One could simply wait and an existing application would run faster and more efficiently. Our economy has become dependent on these increases in computing performance and efficiency to enable new features and new applications. Today, Moore's Law has largely ended,<sup>12</sup> and we must

CACM July 2020

48 COMMUNICATIONS OF THE ACM | JULY 2020 | VOL. 63 | NO. 7

look to alternative architectures with lower overhead, such as domain-specific accelerators, to continue scaling of performance and efficiency. There are several ways to realize domain-specific accelerators as discussed in the sidebar on accelerator options.

A domain-specific accelerator is a hardware computing engine that is specialized for a particular domain of applications. Accelerators have been designed for graphics,<sup>26</sup> deep learn-ing,<sup>16</sup> simulation,<sup>2</sup> bioinformatics,<sup>49</sup> image processing,38 and many other tasks. Accelerators can offer orders of magnitude improvements in performance/cost and performance/W com pared to general-purpose computers. For example, our bioinformatics accelerator, Darwin,\* is up to 15,000× faster than a CPU at reference-based, long-read assembly. The performance and effi-ciency of accelerators is due to a combination of specialized operations, parallelism, efficient memory systems and reduction of overhead. Domain specific accelerators7 are becoming more pervasive and more visible, because they are one of the few remaining ways to con tinue to improve performance and effi ciency now that Moore's Lawhas ended.<sup>22</sup> Most applications require modifi-

cations to achieve high speed up on

#### » key insights

 Most speedup comes from parallelisn enabled by specialization—the main source of efficiency.

 The underlying algorithms often have to change—trading increased hardwarefriendly computation for reduced memory bandwidth demands.

Accelerator design is really parallel programming guided by a cost model arithmetic is free and global memory is expensive.

 Memory typically dominates both area and power of domain-specific accelerators.

 Specialized instructions give much of the advantage of a DSA at a fraction of the development cost and while retaining programmability.

Domain-specific accelerators are one of the few ways to continue scaling the performance and efficiency of computing hardware.

10/5/22

#### Question: Can Every Programmer Easily Design DSAs?

#### Or Can Every <u>Serious</u> Programmer Easily Design DSAs?

#### **Current Answer: Yes and No**

10/5/22

#### It's Natural to Think about High-Level Synthesis (HLS)?



#### Significant progress in the past decade

- Example: xPilot (UCLA 2006) -> AutoPilot (AutoESL)
   -> Vivado HLS (Xilinx 2011-)
  - Platform-based C to RTL synthesis
  - Synthesize pure ANSI-C and C++, GCC-compatible compilation flow leveraging LLVM framework
  - Full support of IEEE-754 floating point data types & operations
  - Efficiently handle bit-accurate fixed-point arithmetic
  - SDC-based scheduling
  - Automatic memory partitioning
- ...

QoR matches or exceeds manual RTL for many designs

TCAD April 2011 (keynote paper) "High-Level Synthesis for FPGAs: From Prototyping to Deployment"

#### Good News: Not Difficult to Create Circuits from C/C++ Using HLS





#### Challenge 1: Synthesized Circuit May Not Have Good Performance

• Not surprising if you have done multi-core programming – the same problem!

122x speedup

• Need to add pragmas (microarchitecture hints).



When targeting FPGA, 13x slower than running on a single-core CPU

After proper pragma insertions

Based on the Merlin Compiler, open-sourced by AMD/Xilinx

10

#### Challenge 2: # Possibilities for Pragmas Insertion Can Be Very Large!



Search space by AutoDSE

#### **Overview of Our Approach**



**Goal: "Democratize" accelerator designs for customized computing** 

#### Example of Architecture-Guided Optimization: AutoSA

Wang, Jie, Licheng Guo, and Jason Cong. "AutoSA: A Polyhedral Compiler for High-Performance Systolic Arrays on FPGA." FPGA'2021



# Systolic Array Advantages







#### Many Accelerators are Based on Systolic Arrays



Google TPU



**Tesla Self-Driving Chip** 



**Amazon Infrentia** 

### Systolic Array Design Stories from Industry





# **Overview of AutoSA Compilation Flow**



- Extract polyhedral model from the source code.
- Examine if the target program can be mapped to systolic array.
- Construct and optimize PE arrays
  - Space-time mapping, array partitioning, latency hiding, vectorization
- Construct and optimize I/O network
  - O I/O network analysis, double buffering, data-packing
- Generate target hardware code

#### Challenge: Large Design Space & Many Optimization Opportunities







Dataflows types(6) X Dataflow Configurations(O(2^40))

• Space-time mapping: transforming the program to a systolic array with space-time mapping.



• Array partitioning: partitioning the array into smaller sub-arrays to fit limited on-chip resource.





• Latency hiding: permuting parallel loops inside to hide computation latency.

| Latency Hiding                                                                                                                                                                                                                                                             |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <pre>for (int i.0 = 0; i.0 &lt; I/T_I1; i.0++) for (int j.0 = 0; j.0 &lt; J/T_J1; j.0++) for (int k.0 = 0; k.0 &lt; K/T_K1; k.0++) for (int i.1 = 0; i.1 &lt; T_I1/T_I2; i.1++) for (int j.1 = 0; j.1 &lt; T_J1/T_J2; j.1++) for (int k.1 = 0; k.1 &lt; T_K1; k.1++)</pre> |
| <pre>for (int i.2 = 0; i.2 &lt; T_I2; i.2++) for (int j.2 = 0; j.2 &lt; T_J2; j.2++) C[] += A[] * B[]; Latency hiding loops</pre>                                                                                                                                          |

• SIMD vectorization: vectorizing computation to amortize the PE control overheads.

| SIMD V | ectorization                                                                           |           |
|--------|----------------------------------------------------------------------------------------|-----------|
|        | <pre>c (int i.0 = 0; i.0 &lt; I/T_I1; i.0++)</pre>                                     |           |
|        | for (int j.0 = 0; j.0 < J/T_J1; j.0++)<br>for (int k.0 = 0; k.0 < K/T_K1; k.0++)       |           |
|        | for (int i.1 = 0; i.1 < T_I1/T_I2; i.1++)                                              |           |
|        | for (int j.1 = 0; j.1 < T_J1/T_J2; j.1++)<br>for (int k.1 = 0; k.1 < T_K1/T_K2; k.1++) |           |
|        | for (int i.2 = 0; i.2 < $T_12$ ; i.2++)                                                |           |
|        | for (int j.2 = 0; j.2 < T_J2; j.2++)<br>for (int k.2 = 0; k.2 < T_K2;k.2++)            | SIMD loop |
|        | C[] += A[] * B[];                                                                      |           |

# What about Data Communication?

• Polyhedral model supports precise data dependence analysis.



| Dependence | Dependence Type | Array Access | Dependence Distance        | I/O Group |
|------------|-----------------|--------------|----------------------------|-----------|
| D1         | Read (RAR)      | A[i][k]      | ( <b>0</b> , <b>1</b> , 0) | g1        |
| D2         | Read (RAR)      | B[k][j]      | ( <b>1</b> , <b>0</b> , 0) | g2        |
| D3         | Flow (RAW)      | C[i][j]      | ( <b>0</b> , <b>0</b> , 1) | g3        |
| D4         | Output (WAW)    | C[i][j]      | ( <b>0</b> , <b>0</b> , 1) | g4        |

We omit the statement of array initialization for brevity.

10/5/22

Example: I/O network generation based on the polyhedral model

### **Use Dependency to Construct Communication Network**

• Polyhedral model supports precise data dependence analysis.



| Dependence | Dependence Type | Array Access | Dependence Distance        | I/O Group |
|------------|-----------------|--------------|----------------------------|-----------|
| D1         | Read (RAR)      | A[i][k]      | <b>(0, 1, 0)</b>           | g1        |
| D2         | Read (RAR)      | B[k][j]      | ( <b>1</b> , <b>0</b> , 0) | g2        |
| D3         | Flow (RAW)      | C[i][j]      | ( <b>0</b> , <b>0</b> , 1) | g3        |
| D4         | Output (WAW)    | C[i][j]      | ( <b>0</b> , <b>0</b> , 1) | g4        |

We omit the statement of array initialization for brevity.

10/5/22

Example: I/O network generation based on the polyhedral model

24

# Auto-Tuning in AutoSA (More in Late Slides)

Generality



#### **Search Time**

# **Benchmark Examples and Productivity Gain**

| Application              | Problem Size                                                     | #Statements | Input C LOC | Output HLS LOC |
|--------------------------|------------------------------------------------------------------|-------------|-------------|----------------|
| Matrix<br>Multiplication | [ <i>i</i> , <i>j</i> , <i>k</i> ]: [1024,1024,1024]             | 2           | 7           | 9265           |
| CNN                      | [i, o, h, w, p, q]: $[512, 512, 56, 56, 3, 3]$                   | 2           | 10          | 9861           |
| MTTKRP                   | [ <i>i</i> , <i>k</i> , <i>l</i> , <i>j</i> ]: [512,512,512,512] | 2           | 9           | 7858           |
| TTMc                     | [i, j, k, l, m]: [128,128,128,128,128]                           | 2           | 9           | 7637           |
| LU<br>Decomposition      | [n]: [12/16/20/24]                                               | 9           | 27          | 1316           |

Complex systolic Array from C-to-Silicon in a day! Recall that common industry practice requires 4-18 months.

### Performance

| Benchmark |                          | Platform             | Array Sizes | Data Type | GFLOPs | MHz | DSP<br>Efficiency |
|-----------|--------------------------|----------------------|-------------|-----------|--------|-----|-------------------|
| CNIN      | Wei et al.<br>'17        | Intel Arria<br>10    | 8x19x8      | FP32      | 602.8  | 253 | 97%               |
| CNN       | AutoSA                   | Xilinx Alveo<br>U250 | 16x14x8     | FP32      | 950.2  | 272 | 97%               |
| MTTKRP    | Srivastava<br>et al. '19 | Intel Arria<br>10    | 8x9x16      | FP32      | 700    | 204 | 99%               |
| WITTKKP   | AutoSA                   | Xilinx Alveo<br>U250 | 16x8x8      | FP32      | 896.7  | 296 | 99%               |
| TTMc      | Srivastava<br>et al. '19 | Intel Arria<br>10    | 8x10x16     | FP32      | 738    | 205 | 94%               |
| TTVIC     | AutoSA                   | Xilinx Alveo<br>U250 | 16x8x8      | FP32      | 886.2  | 290 | 99%               |

# **AutoSA is Open-Sourced**

- Github: <u>https://github.com/UCLA-VAST/AutoSA</u>
- Document: <a href="https://autosa.readthedocs.io/en/latest/">https://autosa.readthedocs.io/en/latest/</a>



## Some Architecture Insights from AutoSA

- Example: 1024x1024x1024 GEMM
- Common wisdom: dimensions of a systolic array be divisors of the problem size.
  - Timeloop ISPASS '19 (MIT, Nvidia, Stanford),
  - dMazeRunner TECS '19 (Ariazon State Univ., Yonsei Univ., Intel)
  - Interstellar ASPLOS '20 (Stanford, Tsinghua)
- Non-divisor solution can be 50% faster





# Some More Architecture Insight using AutoSA

- Common wisdom: Minimize off-chip communication. E.g.
  - Marvel Arxiv '20 (Georgia Tech, Nvidia),
  - Chen et al. HPCA '20 (UCAS, Tsinghua Univ.)
- Again, not necessarily!

| SA Size<br>(Cols,Rows, SIMD) | Minimization<br>Goal | DSPs | Frequency | Throughput     | DRAM<br>Traffic | CTC<br>(FLOP/byte) | Effective<br>Bandwidth |
|------------------------------|----------------------|------|-----------|----------------|-----------------|--------------------|------------------------|
| 32x4x8                       | DRAM Traffic         | 5120 | 282 MHz   | 496.16 GFLOP/s | 16.7 MB         | 128                | 4.3 GB/s               |
| 16x13x8                      | Latency              | 8320 | 243 MHz   | 764.46 GFLOP/s | 80.3 MB         | 26.7               | 36.5 GB/s              |

# What about General C/C++ Programs?

• Adopting the Merlin Compiler, developed by Falcon Computing (acquired by Xilinx in 2020 and open-sourced in 2021)

#### #pragma ACCEL parallel

- Run multiple loop iterations in parallel (instruction/task-level)

#### #pragma ACCEL pipeline

- Run multiple loop iterations in pipeline (instruction/task-level)

OpenMP for multi-core CPUs

Merlin for FPGAs



#### Automated code transformation and transformation

- On-chip memory banking/partitioning/delinearization
- o External memory bursting/streaming/coalescing
- Host interface and host code generation (In OpenCL)

10/5/22 Advanced options for parallel: reduction and stencil variables

#### AutoDSE: Bottleneck-based Optimizer [TODAES'22]



# **Evaluation on Xilinx Vitis Library**

- Tested on 33 kernels, each has 13.5 HLS optimization pragmas on average,
  - AutoDSE achieves roughly the same performance (1.04x higher)
  - Eliminated all HLS or Merlin optimization pragmas
- Both Merlin and AutoDSE keep and propagate dataflow and streaming pragmas
  - Will rely on dataflow composition using TAPA (later)



10/5/22

34

#### Current Goal: More Extensive DSE Using Deep Graph Learning

#### • Review of the problem

#### Manual code

- MVT kernel from Polybench
  - Two matrix-vetor multiplications

#### Solution space

• > 3M design choices



#### **Step 1**: Create a Database for Training the Model



# Step 2: Represent the Program as a Graph

- Build the graph using the LLVM IR to capture lower-level instructions, i.e. closer to hardware
- Need to include both the program semantic and pragma flow in the graph
  - Program semantic: control, data, and call flow
    - Adapting the latest representation proposed for including these information (ProGraML [ICML'21])



foo

• The graph is generated once per kernel and filled with different pragma values later on

### **Step 3: Build a Predictive Model**

• GNN-based model:



# **Design Space Exploration in GNN-DSE**

- The trained model is replaced with the HLS tool for evaluating the design points
- The top M design points are evaluated with the HLS tool and added to the training database for subsequent trainings



# **Experimental Results**

#### • Model's performance

• Regression loss is in RMSE

| Model | Method                                      | Speedup | DSP  | LUT  | FF   | BRAM | All  | Accuracy | F1-score |
|-------|---------------------------------------------|---------|------|------|------|------|------|----------|----------|
| M1    | MLP-pragma (based on Kown, et al. MLCAD'20) | 3.28    | 0.59 | 0.31 | 0.25 | 0.34 | 4.76 | 0.52     | 0.42     |
| M2    | M1 + program context                        | 2.94    | 0.47 | 0.24 | 0.13 | 0.16 | 3.94 | 0.78     | 0.40     |
| M3    | GNN-DSE                                     | 0.56    | 0.13 | 0.08 | 0.06 | 0.05 | 0.85 | 0.93     | 0.87     |

- Keep augmenting database until design space exploration (DSE) matches the best designs
  - Initial database:
    - 4428 total configs / 1036 valid configs
  - Final database:
    - 4752 total configs / 1278 valid configs
- More training examples lead to better accuracy



# **Experimental Results on Unseen Kernels**

- DSE results on new kernels which were not in the database
  - All new kernels dealing with matrix vector operations
    - But with different coding styles, input sizes, and loop trip counts from our database
- Baseline: AutoDSE after 21 h
- GNN-DSE could achieve about the same performance
  - From -2% and +5% difference with a mean of +1%
  - With a maximum DSE time of 1 hour
- Adapting to domain shift in "Improving GNN-Based Accelerator Design Automation with Meta Learning [DAC'22]"

| Kernel  | # pragma | # Design<br>configs | DSE + HLS<br>Runtime (mins) | # Explored | Runtime<br>Speedup |
|---------|----------|---------------------|-----------------------------|------------|--------------------|
| bicg    | 5        | 3,536               | 18                          | 3,536      | 69x                |
| doitgen | 6        | 179                 | 16                          | 179        | 11x                |
| gesummv | 4        | 1,581               | 16                          | 1,581      | 79x                |
| 2mm     | 14       | 492,787,501         | 74                          | 78,676     | 17x                |

# **Current Limitation of GNN-DSE – Domain Shift**

#### • Experimental evidence

- Trained on a suite of 9 kernels
- Tested on 5 different kernels with only 20 labeled designs for each of the 5 new kernel
- Root mean square error (RMSE) on the hold-out test set of each new kernel

|         | jacobi-1d | fdtd-2d | gemm   | 3mm    | gemver |
|---------|-----------|---------|--------|--------|--------|
| GNN-DSE | 4.2496    | 6.7047  | 7.5337 | 9.1584 | 4.4717 |

• DSE speedup with respect to AutoDSE after 20 hours

|         | jacobi-1d | fdtd-2d | gemm  | 3mm   | gemver |
|---------|-----------|---------|-------|-------|--------|
| GNN-DSE | 0.44x     | 0.06x   | 0.87x | 0.30x | 0.20x  |

• Accuracy drops when the testing kernels differ a lot from the training ones (domain shift), causing unsatisfactory DSE results. Meanwhile, our goal is to design a method that works well on any real-world kernel.

# Proposal: Use Transfer Learning (GNN-DSE-MAML)



Model-Agnostic Meta-Learning (MAML) -- Finn et al. 2017.

10/5/22

43



### **GNN-DSE (top) vs GNN-DSE-MAML (bottom)**

# Inspiration: K-shot Image Classification Using Meta-Learning

- Meta-learning:
  - Compute a model that can eventually generalize across many tasks
  - with good data and computation efficiency:
- Example:
  - *K*-shot image classification task:
  - learn a classification model that can quickly adapt to a new class with only *K* images from that class

#### Training task 1

Support set

<=2





N=3

Query set



#### Training task 2 $\cdot \cdot \cdot$

Support set



K. 🗶 🥶

#### Query set



#### Test task 1 · · ·

Support set



Query set



# **MAML** for Training

Algorithm 1 Training procedure of GNN-DSE-MAML

**Require:**  $p(\mathcal{P}^{(train)})$ : distribution over kernels (programs) for training **Require:**  $\alpha$ ,  $\beta$ : step size hyperparameters 1: randomly initialize  $\theta$ 2: while not done do Sample batch of kernels  $\mathcal{P}_i \sim p(\mathcal{P}^{(train)})$ 3: for all  $\mathcal{P}_i$  do 4: Sample *K* datapoints  $\mathcal{D} = \{X_i, Y_i\}$  from  $\mathcal{P}_i$ 5: Evaluate  $\nabla_{\theta} \mathcal{L}_{\mathcal{P}_i}(f_{\theta})$  using  $\mathcal{D}$  and  $\mathcal{L}_{\mathcal{P}_i}$  in Equation 1 6: Compute adapted parameters with gradient descent:  $\theta'_{i}$  = 7:  $\theta - \alpha \nabla_{\theta} \mathcal{L}_{\mathcal{P}_i}(f_{\theta})$ Sample datapoints  $\mathcal{D}'_i = \{\mathbf{x}^{(j)}, \mathbf{y}^{(j)}\}$  from  $\mathcal{P}_i$  for the meta-8: update end for 9: Update  $\theta \leftarrow \theta - \beta \nabla_{\theta} \sum_{\mathcal{P}_i \sim \mathcal{P}(\mathcal{P})} \mathcal{L}_{\mathcal{P}_i}(f_{\theta'_i})$  using each  $\mathcal{D}'_i$  and 10:  $\mathcal{L}_{\varphi_i}$  in Equation 1 11: end while



# MAML for Adaptation

Algorithm 1 Training procedure of GNN-DSE-MAML



• $heta_3^*$ 

# **Experimental Results – Offline Testing**

- K=20 for adaption
- Adaptation is necessary for the unadapted model to obtain lower error
- FineTune: Naïve adaptation using the regular objective function
- Under 4 out of 5 kernels, MAML leads to a more accurate adapted model.

| Method           | jacobi-1d | fdtd-2d | gemm   | 3mm    | gemver |
|------------------|-----------|---------|--------|--------|--------|
|                  | 4.2496    | 6.7047  | 7.5337 | 9.1584 | 4.4717 |
| GNN-DSE-FINETUNE | 3.2611    | 4.0831  | 1.7342 | 6.2930 | 3.1600 |
| GNN-DSE-MAML     | 2.3898    | 2.4912  | 2.1116 | 5.9670 | 3.0303 |

# **Experimental Results – DSE**

- MAML-based adaptation achieves great performance for 3 new kernels
  - 3mm: >17 trillion design candidates that AutoDSE got to explore only 149 of them after 20 hours since it relies on the HLS tool for evaluating each candidate
    - GNN-DSE-MAML yields a significant speedup for 3mm compared to AutoDSE

| Method            | jacobi-1d     | fdtd-2d       | gemm          | 3mm            | gemver        |
|-------------------|---------------|---------------|---------------|----------------|---------------|
| Gnn-Dse-Unadapted | 0.44×         | 0.06×         | 0.87×         | 0.30×          | 0.20×         |
| GNN-DSE-FINETUNE  | 0.54×         | $0.04 \times$ | $0.18 \times$ | $1.00 \times$  | $0.22 \times$ |
| GNN-DSE-MAML      | $1.00 \times$ | ТО            | $1.21 \times$ | $64.52 \times$ | ТО            |
| TO: Timed Out     |               |               |               |                |               |

# **Experimental Results – DSE**

- For ftdt-2d and gemver, the MAML results lead to Timed Out
  - The MAML-based model uses high degree of parallelization for each section of the loop nests, overwhelming the HLS tool.
  - Such cases were not covered in the K sampled samples for adapting the model.

| Method                                | jacobi-1d     | fdtd-2d       | gemm          | 3mm            | gemver        |
|---------------------------------------|---------------|---------------|---------------|----------------|---------------|
| GNN-DSE-UNADAPTED                     | 0.44×         | 0.06×         | 0.87×         | 0.30×          | 0.20×         |
| Gnn-Dse-Unadapted<br>Gnn-Dse-FineTune | $0.54 \times$ | $0.04 \times$ | $0.18 \times$ | $1.00 \times$  | $0.22 \times$ |
| Gnn-Dse-Maml                          | $1.00 \times$ | ТО            | $1.21 \times$ | $64.52 \times$ | ТО            |
| TO: Timed Out                         |               |               |               |                |               |

# AutoDSE and GNN-DSE are Open-source

#### <u>https://github.com/UCLA-VAST/GNN-DSE</u>



Scan me!

<u>https://github.com/UCLA-VAST/AutoDSE</u>



Scan me!



### How to Integrate Different Approaches?

### HeteroCL Programming Infrastructure [FPGA'19]

- Inspired by Halide: Separate program specification and optimization (scheduling)
  - Flexible: Mixed declarative & imperative programming

10/5/22

- Portable: Clean decoupling of algorithm & hardware customizations
- Efficient: Mapping to high-performance spatial architecture templates



Open-source: https://github.com/cornell-zhang/heterocl

### HeteroCL in a Nutshell

| HeteroCL code<br>r = hcl.reduce_axis(0, 3) Declarative code<br>c = hcl.reduce_axis(0, 3) (based on TVM)<br>out = hcl.compute(N, N),<br>lambda y, x:<br>hcl.sum(image[x+r, y+c]*kernel[r, c],<br>axis=[r, c])) | Corresponding C code<br>for (int y = 0; y < N; y++)<br>for (int x = 0; x < N; x++)<br>for (int r = 0; r < 3; r++)<br>for (int c = 0; c < 3; c++)<br>out[x, y] += image[x+r, y+c] * kernel[r, c]<br>Unroll<br>inner loops |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Custom s = hcl.create_schedule()<br>Compute s[out].unroll([r,c])                                                                                                                                              | (%)<br>60<br>40<br>20<br>20                                                                                                                                                                                              |
| Customfor i in range(2, 8):Data Types.quantize([out], Fixed(i, i-2))                                                                                                                                          | 20<br>2 4 6 8<br>linebuffer                                                                                                                                                                                              |
| Custom linebuf = s[image].reuse_at(out, out.y)<br>Memory winbuf = s[linebuf].reuse_at(out, out.x)                                                                                                             |                                                                                                                                                                                                                          |
| YH. Lai, et al., <i>HeteroCL: A Multi-Paradigm Programming Infrastructure</i>                                                                                                                                 | image window buffer kernel out                                                                                                                                                                                           |

10/5/22 for Software-Defined Reconfigurable Computing, FPGA'2019 Best Paper Award

### HeteroCL: Mapping to Spatial Architecture Templates

#### • Systolic Array

#### # matrix multiply kernel

out = hcl.compute(N, N), lambda y, x: sum(A[x, k] \* B[k, y]), axis=k)

s[out].systolic()



#### Stencil Architecture

# jacobi kernel

```
out = hcl.compute(N, N),
lambda y, x:
(in[y,x-1]+ in[y-1,x] + in[y,x] + in[y,x+1] + in[y+1,x])/5))
```

s[out].stencil()



One More Question:

Now I am good at using (enhanced) HLS, how to deal with (low) clock frequency and (long) compilation time from downstream physical synthesis ?

- FPGAs are increasingly large
- Multiple dies integrated together
- High delay penalty for die-crossing
- Large IPs with pre-determined location



Xilinx Alveo Xilinx Alveo U250 U280

- FPGAs are increasingly large Die boundaries
- Multiple dies integrated together
- High delay penalty for die-crossing
- Large IPs with pre-determined location



Xilinx Alveo Xilinx Alveo U250 U280

- FPGAs are increasingly large Die boundaries
- Multiple dies integrated together
- High delay penalty for die-crossing
- Large IPs with pre-determined location

**DDR** controllers



- FPGAs are increasingly large Die boundaries
- Multiple dies integrated together
- High delay penalty for die-crossing
- Large IPs with pre-determined location

**DDR** controllers



- FPGAs are increasingly large Die boundaries
- Multiple dies integrated together
- High delay penalty for die-crossing
- Large IPs with pre-determined location
- HLS has limited consideration of those physical barriers

Peripheral IPs (e.g., **HBM** Controller PCle) Xilinx Alveo Xilinx Alveo U250 U280 61

**DDR** controllers

### AutoBridge [FPGA'21 Best Paper Award]

- Add extra pipeline stages to long interconnects
- Couples floorplanning with HLS pipelining ٠
- Global optimization to assure correctness ٠

[FPGA'22] Accelerating SSSP for Power-Law Graphs

Sparse-Matrix Dense-Matrix Multiplication/

Successful Application DR-2 DDR 3

systolic arrays on fpga

- Automate latency-insensitive design at the HLS level ٠
- Improve average frequency from 150 MHz to ٠ 297 MHz over 43 test cases.



### **Case Study**

• Gaussian Elimination, 8 configurations



- **Difference in Resource Utilization** 
  - LUT: -0.14% 0
  - FF: -0.04% 0
  - BRAM: -0.03% 0
  - DSP: +0.00% 0





Default

24x24

U280

AutoBridge

Comparison of the 24x24 Design on U250

### Latency-Insensitive Designs Helped Compile Time as Well!



RapidStream [FPGA '22 Best Paper Award]

### **Experimental Result**

- Tested on 6 large scale dataflow designs targeting Xilinx U250 FPGA with 4 SLRs (dies)
- Distribute to 4 Xeon servers, each with 56 cores
- Divide the FPGA into 32 islands (8 rows, 4 columns)
- 5-7X speedup (from C++ to fully routed checkpoint)
- Up to 1.3X frequency improvement





### Use Overlay for Even Faster Compilation: OverGen [MICRO'22]



### Composing Large Dataflow Designs Using TAPA

- TAPA programs explicitly decouple communication and computation
- Computation => compiled by Vitis HLS / AutoSA / AutoDSE / ...
- Communication => generated by TAPA



# **Example: FlexCNN Using TAPA**

- FlexCNN: an end-to-end automated DNN synthesis framework
- From ONNX to bitsream on FPGAs



# Concluding Remark 1

- I am encouraged by the progress/results on democratizing accelerator designs and customized computing
- It takes a community-wide effort
- All our tools are open-sourced, and FPGA vendors are more open as well
  - One-API from Intel
  - Merlin from AMD/Xilinx (after acquisition of Falcon Computing)
- Increasingly interested in using MLIR as an integration point

# Concluding Remark 2

- Important for the architecture community to have a rapid prototyping flow
  - From Idea to Silicon in days, not months/years
- Concerned with some accelerator evaluation methodology
  - "We evaluate XXX using a C++-based cycle-level simulator."
  - Does it consider
    - reduced memory bandwidth due to short burst length?
    - interconnect network size and latency from HBM ports to logic elements?
    - interconnect delays ...?
  - Has it been validated against any real silicon (FPGA or ASIC)?

# **Concluding Remark 3**

• I had the pleasure working with many collaborators in other application domains.







Alex Bui and William Hsu Low-dose CT reconstruction

Tad Blair Real-time neural signal processing



Yizhou Sun Graph similarity computation

- It's time to enable domain experts to design their own accelerators!
- The deep learning community has done a much better job "every" domain expert can train complex DL models
- Can we catch up? Think about broader impact!

# **Final Remark**

- No doubt we are in an exciting era for computer architecture
- We want to every (serious) software programmer to participate
  - Not just architects
- Build his/her own customized accelerators on field-programmable fabrics
  - On premise or in the cloud
- I hope that many of you can join this effort

# turing lecture

#### DOI:10.1145/3282307

Innovations like domain-specific hardware, enhanced security, open instruction sets, and agile chip development will lead the way.

BY JOHN L. HENNESSY AND DAVID A. PATTERSON

# A New Golden Age for Computer Architecture



2018 Turing Award Lecture

# A Story ...

#### WCC UPPER LEVEL HLS/cafe LUNCH + Full salad bar M – F. + Soups & chilis 11:30am-2:30pm + Hot stations (International. Plant Forward, American BBQ) Wide selection of diverse, + Deli delicious & inspired items + Pizza + Grill + Baked goods & pastries + Grab n' go salads & sandwiches $\mathcal{O} = \mathcal{O}$ + Grab n' go yogurt & fruits

- Q: Does everyone here do High-Level Synthesis?
- A: What do you mean? We are all from Harvard Law School.

# Acknowledgements: NSF, JUMP/CRISP, and CDSC Industrial Partners

• Multi-year efforts by many students, postdocs, and collaborators



Prof. Tony Nowatzki (UĆLA)



(UCLA)

Peng Wei

(UCLA)

Sihao Liu

(UCLA)







Prof. Zhiru Zhang (Cornell Univ.)



Prof. Peipei Zhou (Univ. of Pittsburgh)



Prof. Vivek Sarkar (Georgia Tech)





Zhengrong Wang (UCLA)



Yunsheng Bai

Yuze Chi (UCLA)



(UCLA)

Atefeh Sohrabizadeh

(UCLA)



Hao Yu (UCLA/Falcon)

Zhe Chen

(UCLA)



Yi-Hsiang Lai (Cornell)



Weikang Qiao (UCLA)



Licheng Guo

(UCLA)



Jian Weng (UCLA)



Jason Lau (UCLA)





Suhail Basalama (UCLA)



# Thank You!