## The RISC Journey from One to a Million Processors

Dave Ditzel Founder and CTO Esperanto Technologies, Inc.

dave@esperanto.ai

October 3, 2022 MICRO-55 Keynote

#### A Personal View of Computer History

Designing computers has been a lifelong passion

Fortunate to have been at the right place a number of times

Iowa State University: SYMBOL High Level Language Computer & first 8-bit uP chips ٠ UC Berkeley: At transition from HLLCA to RISC, under prof. D. Patterson ٠ Bell Labs: At the creation of Unix, C, C++, early RISC processors (C-Machines) • At transition from M68K to RISC (SPARC) Sun Microsystems: ٠ Transmeta: Binary translation onto VLIW-RISC ٠ Intel: Binary translation onto OOO-RISC ٠ Massively-parallel energy-efficient RISC-V processors Esperanto: •

Got to see the evolution of early architectures from 8-bit to 64-bit superscalar monsters

One trend in design seems to have withstood the test of time more than any other: RISC

Useful quote for computer architects

## "Those who cannot remember the past are condemned to repeat it."

George Santayana, The Life of Reason, 1905.

### Computer Development Timeline of Notable Designs

| 1950                                      | 1960                              | 1970                                   | 1980                                                                                    | 1990                           | 2000  | 2010                     | 202                     | 0 2030 |
|-------------------------------------------|-----------------------------------|----------------------------------------|-----------------------------------------------------------------------------------------|--------------------------------|-------|--------------------------|-------------------------|--------|
| 1 <sup>st</sup><br>Transistor<br>Computer | DEC<br>PDP-1<br>IBM<br>System/360 | DEC<br>VAX<br>Xerox<br>Alto            |                                                                                         | DEC<br>Alpha                   |       |                          |                         |        |
|                                           |                                   | IBM<br>801<br>Bell<br>Labs<br>C-Machin | The<br>Case for<br>RISC<br>UCB<br>RISC-I,II<br>Stanford<br>MIPS<br>es<br>Bell Labs CRIS | SPARC<br>MIPS Co.<br>SP/Hobbit |       | RISC-V<br>ISA<br>started | RISC-V<br>International |        |
|                                           | CISC Era                          | 20 years                               | RISC Era app                                                                            | proaching 50                   | years |                          |                         |        |
|                                           |                                   |                                        |                                                                                         |                                |       |                          |                         |        |

#### 1978 Photo: My career started at the same time as the first microprocessors



My personal hobby computer 8-bit 6502 Microprocessor 4 MHz 4K Bytes of main memory Hand "wire-wrapped" Paper tape reader Hand wound transformer on power supply. Note – base of SYMBOL computer

in background

### Era of High Level Language Computer Architecture

later called CISC (Complex Instruction-set Computers)

### DEC VAX-11/780: a 32-bit CISC

A real 70's architecture:

- VAX: Transitioned from 16-bit PDP-11 assembly to 32-bit addresses and compiled code
- ISA made it easy to take high level language statements and cast into VAX assembly code

Instruction set heavily influenced by availability of 8-bit wide integrated circuits

VAX ISA composed of variable length instructions

- Instruction opcode (1 or 2 bytes) followed by up to 6 operands
- First operand descriptor (1 byte)
- First operand data (0-4 bytess
- Second operation descriptor
- Second operand data
- Third operand descriptor
- Third operand data (0-4 bytes)
- ...

This was great when decoding serially one byte at a time,

but a nightmare for pipelined or later superscalar implementations

#### Xerox Palo Alto Research Center: Alto



First Graphical User Interface with mouse

5.8 MHz microcoded CPU

User loadable instruction sets led to several

HLLCA type bytecoded instruction sets for languages like BCPL, Smalltalk and MESA

Enabled exploration of highly tailored instruction sets

But raised the question

"Why not compile to lowest microcode level?"

# SYMBOL: The Ultimate CISC

### SYMBOL: The ultimate Complex Instruction Set Computer

SYMBOL was a High Level Language Computer announced in 1971 by Fairchild Semiconductor

Supported by Gordon Moore and Robert Noyce at Fairchild until they left to form Intel

Goal was to reduce cost of software by using hardware instead, literally "programmers cost too much".

Implemented Compiler, Text Editor and Operating System entirely in logic gates

- 2 gates or 1 Flip-Flop per 14-pin DIP package, no ROM
- 20 thousand chips took <u>several years</u> to debug at ISU
- Instruction set was bytecode mapped almost 1-1 from high level ALGOL/LISP like typeless language
- Tagged architecture with descriptors, and "logical memory" that was not linearly addressable.
- Five years of my life thinking about CISC vs RISC, i.e. better ways to use 20K chips, programming this machine.

I got to work on this computer while a student for 5 years, after it was donated to ISU

Lessons:

- Don't build your compiler or OS in logic gates
- Use right combination of Hardware, Software or μCode

### Debugging SYMBOL, the inspiration for RISC





Inside red circle are 220 chips.

In order to fix a bug, white wires added on top of blue pc board to make logic changes.

Many of the 100 boards were covered with white bug fixes.



A Symbol "terminal" could consist of up to 99 physical I/O devices. The terminal shown at left used a modified IBM Selectric typewriter, editing keyboard, status display, card reader, and line printer. The book next to the typewriter contains operating system and utility source listings. The photo above, a side view of the Symbol mainframe, shows the maintenance processor (far left), power supplies for +4.5 volts at 1000 amps (below the mainframe), and disk memory, core memory, and paging drum (background). The front view of the mainframe (photo at right) shows a printed circuit card on an extender for testing. Each card held 200 ICs. Wing panels indicate the value of bus signals-100 on the left, 100 on the right, and 50 on top. Additional wire on the PCB was used for "bug" fixes. Each processor could be monitored from a "processor active" lamp on the system's control panel

Source: D. Ditzel, Reflections on the High Level Language SYMBOL Computer Systems, IEEE Computer, July 1981. Dave Ditzel – MICRO-55 Keynote -- From One to a Million RISC Processors

#### **HLLCA** Lessons

Putting functions in hardware does not necessarily make them higher performance

Putting functions in hardware always makes them harder to debug, modify and fix

High level opcodes often precluded the ability to make code generation optimizations

Example: Symbol precluded any use of pointer arithmetic, array accesses quite slow.

Better to provide a low level instruction set with HLL Compilers

Use logic gates to make that simple instruction set go very fast

Fascination studying SYMBOL (Ditzel/Patterson) eventually led both to religious conversion to RISC

#### Mood was changing against HLLCA by end of 70's

#### **Retrospective on High-Level Language Computer Architecture**

#### David R. Ditzel

Bell Laboratories Computing Science Research Center Murray Hill, New Jersey

David A. Patterson

Computer Science Division Department of Electrical Engineering and Computer Sciences University of California Berkeley, California

#### Introduction

High-level language computers (HLLC) have attracted interest in the architectural and programming community during the last 15 years; proposals have been made for machines directed towards the execution of various languages such as ALGOL,<sup>1,2</sup> APL,<sup>3,4,5</sup> BASIC,<sup>6,7</sup> COBOL.<sup>8,9</sup> FORTRAN.<sup>10,11</sup> LISP.<sup>12,13</sup> PASCAL.<sup>14</sup> PL/L<sup>15,16,17</sup> SNOBOL,<sup>18,19</sup> and a host of specialized languages. Though numerous designs have been proposed, only a handful of high-level language computers have actually been implemented.<sup>4,7,9,20,21</sup> In examining the goals and successes of high-level language computers, the authors have found that most designs suffer from fundamental problems stemming from a misunderstanding of the issues involved in the design, use, and implementation of cost-effective computer systems. It is the intent of this paper to identify and discuss several issues applicable to high-level language computer architecture, to provide a more concrete definition of high-level language computers, and to suggest a direction for highlevel language computer architectures of the future.

Esoteric: Aesthetics or no stated advantages.

An almost universal justification for high-level language computers is the view that

"the prime motivation for developing such a machine is to reduce system costs, for while hardware logic is becoming much cheaper, software is consuming a greater proportion of total system costs. A tremendous savings can be obtained by designing computer hardware that is oriented to aiding the programmer rather than to simplifying the computer designer's job."<sup>22</sup>

The solution to the software problem has appeared to be an increased use of "inexpensive" hardware. According to this viewpoint, the way to use this extra hardware is to raise the level of the machine language, so that in most cases there exists a one-to-one mapping between the A radical computer architecture implementing a programming language and a timeshared operating system directly in hardware, Symbol remains a valuable lesson in building complex systems.





David R. Ditzel Bell Laboratories

**O**ne of the most radical computer architectures of the last decade was the Symbol<sup>1,2</sup> computer system, unveiled in 1971. The primary goal of the Symbol research project was to demonstrate with a full-scale working computer that a procedural general-purpose programming language and a large portion of a timeshared operating system could be implemented directly in hardware, resulting in a marked improvement in computational rates.<sup>3</sup> Another goal was to show that such a task could be carried out by a relatively small group of people in a reasonable amount of time by using appropriate design tools and construction techniques. Some features commonly provided by software were implemented directly in Symbol's hardware with sequential logic networks. These included

- Hardware compilation,
- Text editing,
- Timesharing supervision,
- Virtual memory management,
- Dynamic memory allocation,
- Dynamic memory reclamation,

however, the reader should be reminded that Symbol was intended to be a learning device rather than a commercially viable product.

Historical background. As early as 1964, a group of engineers at Fairchild's research facility in Palo Alto, California, decided that the future of VLSI technology dictated the use of hardware for traditional software functions. They also believed that existing programming languages had been influenced too heavily by the underlying hardware and that valuable programmer time was unnecessarily being spent performing functions such as memory management because of unreasonable computer architectures. A high-level language computer was seen as a way to reduce rising software costs.

Though Symbol was an experimental machine, the project was taken seriously. From the beginning there was a strong commitment to build a real and functional system. Considerable effort was spent on technology, nakaaing and computer sided design tools. A new high

Our "Pre-RISC" paper call for change ISCA May 1980 (written in 1979)

Recognition that SYMBOL/HLLCA was not right direction IEEE Computer magazine 1981

## **Reduced Instruction Set Computing**

#### 1980: The Case for RISC Published

#### The Case for the Reduced Instruction Set Computer

David A. Patterson

Computer Science Division University of California Berkeley, California 94720

David R. Ditzel

Bell Laboratories Computing Science Research Center Murray Hill, New Jersey 07974

#### INTRODUCTION

One of the primary goals of computer architects is to design computers that are more costeffective than their predecessors. Cost-effectiveness includes the cost of hardware to manufacture the machine, the cost of programming, and costs incurred related to the architecture in debugging both the initial hardware and subsequent programs. If we review the history of computer families we find that the most common architectural change is the trend toward ever more complex machines. Presumably this additional complexity has a positive tradeoff with regard to the costeffectiveness of newer models. In this paper we propose that this trend is not always cost-effective, and in fact, may even do more harm than good. We shall examine the case for a Reduced Instruction Set Computer (RISC) being as cost-effective as a Complex Instruction Set Computer (CISC). This paper will argue that the next generation of VLSI computers may be more effectively implemented as RISC's than CISC's.

#### WORK ON RISC ARCHITECTURES

At Berkeley. Investigation of a RISC architecture has gone on for several months now under the supervision of D.A. Patterson and C.H. Séquin. By a judicious choice of the proper instruction set and the design of a corresponding architecture, we feel that it should be possible to have a very simple instruction set that can be very fast. This may lead to a substantial net gain in overall program execution speed. This is the concept of the Reduced Instruction Set Computer. The implementations of RISC's will almost certainly be less costly than the implementations of CISC's. If we can show that simple architectures are just as effective to the high-level language programmer as CISC's such as VAX or the IBM S/38, we can claim to have made an effective design.

At Bell Labs. A project to design computers based upon measurements of the C programming language has been under investigation by a small number of individuals at Bell Laboratories Computing Science Research Center for a number of years. A prototype 16-bit machine was designed and constructed by A.G. Fraser. 32-bit architectures have been investigated by S.R. Bourne, D.R. Ditzel, and S.C. Johnson. Johnson used an iterative technique of proposing a machine, writing a compiler, measuring the results to propose a better machine, and then repeating the cycle over a dozen times. Though the initial intent was not specifically to come up with a simple design, the result was a RISC-like 32-bit architecture whose code density was as compact as the PDP-11 and VAX [Johnson79].

At IBM. Undoubtedly the best example RISC is the 801 minicomputer, developed by IBM Research in Yorktown Heights, N.Y.[Electronics76] [Datamation79]. This project is several years old and has had a large design team exploring the use of a RISC architecture in combination with very advanced compiler technology. Though many details are lacking their early results seem quite extraordinary. They are able to benchmark programs in a subset of PL/I that runs about five times the performance of an IBM S/370 model 168. We are certainly looking forward to more detailed information.

### Why RISC

Trend to "reduce the semantic gap" was the wrong approach.

Turns out a few really simple instructions is a good match for compilers

It's also much better for the hardware

A simple regular RISC instruction set has simpler control and datapaths and:

- Facilitates efficient instruction pipelining
- Leads to higher operating speed and performance
- Leaves more room for larger caches, for higher performance
- Logic that is "left out" doesn't have bugs, hence fewer bugs. (lately Musk "the best part is no part")

But in the early 1980's RISC vs CISC was not so clear.

• Led to many lively conference debates

We had to build real RISC chips to prove the benefits.

#### Early RISC Principles

**RISC: Reduced Instruction Set Computer** 

• RISC and CISC terms coined by UC Berkeley professors Dave Patterson and Carlo Sequin

Simple orthogonal instructions

- Often fixed 32-bit length
- Easy for compiler code generation
- Easy to implement with small (10-20) gates per cycle
- Easily pipelined

General purpose register file

- Enough to keep data in registers for current and another called subroutine
- Variation in register file style: Flat, Windowed, or invisible (stack cache)

Single load or store per instruction

Register to register arithmetic operations

#### IBM 801: In my view, the first true RISC

#### IBM Introduces the 801 Minicomputer, the First Computer Employing RISC

1974 Permalink



Image Source: www.ibm.com

John Cocke with the computer incorporating RISC architecture that he invented.

In 1974 IBM built the first prototype computer employing RISC (Reduced Instruction Set Computer)<sup>[2]</sup> architecture. Based on an invention by IBM researcher John Cocke<sup>[2]</sup>, the RISC concept simplified the instructions given to run computers, making them faster and more powerful. It was implemented in the experimental IBM 801<sup>[2]</sup> minicomputer. The goal of the 801 was to execute one instruction per cycle.

In 1987 John Cocke received the A. M. Turing Award for significant contributions in the design

and theory of compilers, the architecture of large systems and the development of reduced instruction set computers (RISC); for discovering and systematizing many fundamental transformations now used in optimizing compilers including reduction of operator strength, elimination of common subexpressions, register allocation, constant propagation, and dead code elimination.



Photo of IBM 801 minicomputer.

The name 801 was from the IBM building number of the T.J. Watson Research Center in Yorktown Heights

First described at ASPLOS-1 in 1982

#### UC Berkeley RISC-I ~1981



Most of die consumed with register windows

More registers to keep more operands on die

Single load or store to/from a register

3-address register to register operations

Integer only

No caches

But simple enough for students to design!

**Register Windows** 

### Stanford MIPS ~1983



The MIPS instruction set consists of about 111 total instructions

Each instruction encoded in 32-bits

Pipelined to execute one instruction per clock

The instruction set includes:

- 21 arithmetic instructions (+, -, \*, /, %)
- 8 logic instructions (&, |, ~)
- 8 bit manipulation instructions
- 12 comparison instructions (>, <, =, >=, <=, ¬)
- 25 branch/jump instructions
- 15 load instructions
- 10 store instructions
- 8 move instructions
- 4 miscellaneous instructions

Delayed branch with 2 delay slots

# Early RISC Processors 1978-87: Bell Labs C Machines, including CRISP and Hobbit



### **AT&T CRISP Microprocessor**

C-Language Reduced Instruction Set Processor 1.75 micron CMOS Announced 1987, 1<sup>st</sup> CMOS superscalar (more than 1 instr/clock) chip Hardware translated compact instr into 180-bit wide decoded micro-Op cache Branch folding eliminated the overhead of branches "Stack Cache" performed automatic register allocation in hardware Low power chip, used in EO Tablet computer Software binary translator used to move software from CISC WE32100 Lessons

- External to internal translation worked well
- Decouple external and internal ISA



#### AT&T Bell Labs CRISP: C language Reduced Instruction Set Processor



#### Transition to Bell Labs for 10 years: A more serious look



David R. Ditzel is a member of the technical staff at Bell Laboratories' Computing Science Research Center in Murray Hill, New Jersey. His current research activities include computer architecture, instruction set analysis, VLSI, computer-aided design tools, and personal computing systems. He has a BS in electrical engineering and a BS in computer science from Iowa State University, where he participated in the Symbol

project for four years. In 1979, he received an MS in computer science from the University of California, Berkeley. Ditzel is a member of Tau Beta Pi, Eta Kappa Nu, Phi Beta Kappa, Phi Kappa Phi, ACM, and the IEEE.

#### Instead of putting our initials onto the chip.....



#### CRISP: Wins and losses in early tablets



Did NOT make it into Apple Newton .... AT&T fumbled the opportunity and caused Apple to fund ARM

see Wikipedia "AT&T Hobbit" for the story

Lessons in Technical vs Market Success in Hot Chips Presentation (from 2008)



# Transmeta 1995-2007:

First to provide full x86 compatibility using software binary translation (Code Morphing) to a simpler processor.

Many low power tricks

We proved that doing low power right can be very exciting

#### Slide that crystallized the concern over power growth

#### Power Density The Fundamental Problem



### Transmeta low-power x86 Compatible CPU

#### Efficeon is the sum of

x86 Code Morphing Software

#### **Code Morphing Software**

- Provides Compatibility
- Translates the 1's and 0's of x86 instructions to equivalent 1's and 0's for a simple VLIW processor
- Learns and improves with time

#### **VLIW Hardware**

- Very Long Instruction Word processor
- Simple and fast
- Fewer transistors

Low Power

x86 PC Compatibility

High Performance



## Bill Gates at Comdex 2000 Keynote announced the Tablet would be the Future of Computing, and held up a Transmeta Crusoe based protoytype.

Microsoft's Tablet PC technology enables any Windows-based application to take advantage of pen-based input.

With software developed by Microsoft, the Tablet PC can function as a sheet of paper.

Handwriting is captured as rich digital ink for immediate or later manipulation, including reformatting and editing.

The Tablet PC requires x86 compatibility as it needs to run Windows XP.

This Tablet PC prototype developed by Microsoft demonstrates the concept of tablet computing.



#### People got really excited about low power Transmeta processors!!













Compaq Tablet PC

#### Linus Torvalds led the Transmeta Code Morphing Software team



# Intel 2008-2013

### This slide intentionally left blank

# **RISC-V**

### A free and open instruction set

## Which instruction set?

## **RISC-V** is a better choice



Neither ARM nor x86 is very attractive

- Proprietary and expensive
- Not very energy efficient



New free and open CPU instruction set

- Managed by non-profit RISC-V International
- Over 3000 members
- Simpler ISA, hence more energy efficient
- Allows free or proprietary implementations Already competitive in area and performance Room for improvement
- Growing ecosystem, like early days of Linux

### RISC-V Instructions Described on 2 pages 64-bit ARM Document is ~5000 pages

|           | Free                            | 8     | Op            | en 、             |        | RIS          | 50                         | - 1      | 🗸 R               | efere          | ence              | Ca      | rd               | 1            |
|-----------|---------------------------------|-------|---------------|------------------|--------|--------------|----------------------------|----------|-------------------|----------------|-------------------|---------|------------------|--------------|
| Ra        | se Integer                      |       |               |                  | _      |              |                            | -        | Ĕ                 | RV Priv        |                   |         |                  |              |
| Category  |                                 |       |               | RV32I Ba         |        |              | RV{64,128}                 |          | Catego            | ~              | Name              | 1150    | V mnen           | nonic        |
| Loads     | Load Byte                       | I     | LB            | rd,rs1           |        |              |                            |          | CSR Ac            |                | mic R/W           | CSRRW   | rd,os            |              |
|           | oad Halfword                    | I     | LH            | rd,rs1           |        |              |                            |          |                   | tomic Read     | & Set Bit         | CSRRS   | rd,os            |              |
|           | Load Word                       | I     | LW            | rd,rs1           | imm    | L{D 2}       | rd,rs1,                    | imm      | Ato               | mic Read &     | Clear Bit         | CSRRC   | rd,os            | r,rsl        |
|           | Byte Unsigned                   | I     | LBU           | rd,rs1           | imm    |              |                            |          |                   |                | X/W Imm           |         |                  |              |
|           | Half Unsigned                   | I     | LHU           | rd,rs1           | imm    | L{W D}U      | rd,rs1,                    | imm      | Atomic            | : Read & Set   | : Bit Imm         | CSRRS   | t rd,os          | r,imm        |
| Stores    | Store Byte                      | S     | SB            | rs1,rs           |        |              |                            |          |                   | Read & Clear   |                   |         | t rd,os          | r,imm        |
| s         | tore Halfword                   | S     | SH            | rs1,rs           |        |              |                            |          | Change            |                | Env. Call         |         |                  |              |
|           | Store Word                      | S     | SW            | rs1,rs           |        | 8{D Q}       | rs1,rs2                    |          | Env               | ironment Br    |                   | EBREAD  | ĸ                |              |
| Shifts    | Shift Left                      | R     | SLL           | rd,rs1           |        | SLL{W D]     |                            |          |                   | Environme      |                   |         |                  |              |
| Shift L   | eft Immediate                   | I     | SLLI          | rd,rs1           |        |              | D} rd,rs1,                 |          |                   | edirect to S   |                   |         |                  |              |
|           | Shift Right                     | R     | SRL           | rd,rs1           |        |              | } rd,rs1,:                 |          |                   | ct Trap to H   |                   |         |                  |              |
|           | ht Immediate                    | I     | SRLI          | rd,rs1           |        |              | D} rd,rs1,                 |          |                   | or Trap to S   |                   |         |                  |              |
|           | ght Arithmetic<br>ght Arith Imm | RI    | SRA<br>SRAI   | rd,rs1           |        |              | <pre>} rd,rs1,:</pre>      |          | MMU               | pt Wait for    |                   |         |                  |              |
| Arithme   |                                 | R     | ADD           | rd,rs1<br>rd,rs1 |        |              | D} rd,rs1,:<br>} rd,rs1,:  |          | mmu               | Supervis       | OF PENCE          | OF ENCI | vn rs            | •            |
|           | DD Immediate                    | I     | ADD           | rd,rs1<br>rd,rs1 |        |              | } rd,rs1,:<br>D} rd,rs1,:  |          |                   |                |                   |         |                  |              |
| -         | SUBtract                        | R     | SUB           | rd, rs1          |        |              | } rd,rs1,:                 |          |                   |                |                   |         |                  |              |
| 1.00      | ad Upper Imm                    | ũ     | LUI           | rd,imm           |        |              | ional Com                  |          | cod (16           | i-hit) Inc     | tmustic           | n Evto  | ncior ·          | <b>PVC</b>   |
|           | er Imm to PC                    | ŭ     |               | rd,imm           |        | Categor      |                            | Fmt      | 520 [10           | RVC            | G GC(10)          |         | VI equiv         |              |
| Logical   | XOR                             | R     | XOR           | rd, rs1          | re2    | Loads        | Load Word                  | CL       | C.LW              | rd',rs1'       | imm               |         | ',rsl',          |              |
|           | OR Immediate                    | î     |               | rd,rs1           |        |              | oad Word SP                |          | C.LWSP            | rd,imm         | , <b>-</b>        |         | ,isi,            |              |
|           | OR                              | R     | OR            | rd,rs1           |        |              | Load Double                | CL       | C.LD              | rd',rs1'       |                   |         | ,rs1',           |              |
|           | OR Immediate                    | ĩ     | ORT           | rd, rs1          |        |              | d Double SP                |          | C.LDSP            | rd.imm         | , 1000            |         | ,isi,            |              |
|           | AND                             | Ř     | AND           | rd,rs1           |        |              | Load Quad                  |          | C.LOBP            | rd',rs1'       | imm               |         | ,rs1',           |              |
| 44        | ND Immediate                    | ĩ     | ANDI          | rd,rs1           |        | L 14         | oad Quad SP                | CI       | C.LOSP            | rd,imm         | /                 |         | ,sp,imm          |              |
| Compare   | Set <                           | R     | SLT           | rd,rs1           |        |              | Store Word                 | CS       | C.SW              | rs1',rs2       | ',imm             |         | 1',rs2'          |              |
| Set       | < Immediate                     | I     | SLTI          | rd,rs1           |        | St           | ore Word SP                | CSS      | C.SWSP            | rs2,imm        |                   |         | 2,sp,im          |              |
| S         | et < Unsigned                   | R     | SLTU          | rd,rs1           |        |              | Store Double               | CS       | C.SD              | rs1',rs2       | ',imm             |         | 1', 152'         |              |
| Set < 1   | mm Unsigned                     | I     | SLTIU         | rd,rs1           | imm    | Stor         | re Double SP               | CSS      | C.SDSP            | rs2,imm        |                   | SD rs   | 2,sp,im          | m*8          |
| Branche   | Branch =                        | SB    | BEO           | rs1,rs           | .imm   | 1            | Store Quad                 | CS       | C.80              | rs1',rs2       | .imm              | 30 rs)  | 1'.rs2'          | ,imm*16      |
|           | Branch ≠                        | SB    | BNE           | rs1,rs           |        |              | ore Quad SP                | CSS      | C.808P            | rs2,imm        |                   |         | 2,sp,im          |              |
|           | Branch <                        | SB    | BLT           | rs1,rs           | ,imm   | Arithme      |                            |          | C.ADD             | rd,rs          | 1                 |         | rd,rd,           |              |
|           | Branch ≥                        | SB    | BGE           | rs1,rs           | 2,imm  |              | ADD Word                   |          | C.ADDW            | rd,rs          |                   |         | rd,rd,           |              |
|           | ch < Unsigned                   | SB    | BLTU          | rs1,rs           |        |              | D Immediate                | CI       | C.ADDI            | rd,im          |                   |         | rd,rd,           |              |
|           | th ≥ Unsigned                   | SB    | BGEU          | rs1,rs           | 2,imm  | _            | D Word Imm                 | CI       | C.ADDIW           |                |                   |         | rd,rd,           |              |
| Jump &    |                                 | U     | JAL           | rd,imm           |        |              | SP Imm • 16                |          |                   | 68P x0,im      |                   |         | sp,sp,           |              |
|           | Link Register<br>Synch thread   | υJ    | JALR<br>FENCE | rd,rs1           | 100    |              | SP Imm * 4                 |          |                   | SPN rd',i      |                   |         | rd',sp           |              |
|           | Synch thread<br>h Instr & Data  | I     | FENCE         |                  |        |              | d Immediate<br>I Upper Imm | CI<br>CI | C.LI<br>C.LUI     | rd,im<br>rd,im |                   | ADDI    | rd,x0,<br>rd,imm |              |
|           | System CALL                     | I     | SCALL         |                  |        | Load         | MoVe                       |          | C.MV              | rd,1m<br>rd,rs |                   | ADD     |                  |              |
|           | system BREAK                    | ÷     | SCALL         |                  |        |              | SUB                        |          | C.SUB             | rd,rs<br>rd,rs |                   | SUB     | rd,rs1<br>rd,rd, |              |
|           | ReaD CYCLE                      | Ť     | RDCYC         |                  | 1      | Shifts S     | hift Left Imm              | CT       | C.SLLI            | rd,im          |                   | SLLI    | rd.rd.           |              |
|           | LE upper Half                   | î     | RDCYC         |                  |        |              | s Branch=0                 | CB       | C.BEOZ            | rs1',          |                   | BEO     | rs1',x           |              |
|           | ReaD TIME                       | ī     | RDTIM         |                  | 1      |              | Branch≠0                   | CB       | C.BNEZ            | rs1',          |                   | BNE     | rs1',x           |              |
| ReaD TI   | ME upper Half                   | I     | RDTIM         | EH TO            | 1      | Jump         | Jump                       | CJ       | C.J               | imm            |                   | JAL     | x0,imm           |              |
| ReaD I    | NSTR RETired                    | I     | RDINS         | TRET TO          |        |              | Imp Register               | CR       | C.JR              | rd,rs          | 1                 | JALR    | x0,rs1           |              |
| ReaD INS  | TR upper Half                   | I     | RDINS         | TRETH TO         | 1      | Jump &       |                            | CJ       | C.JAL             | imm            |                   | JAL     | ra,imm           |              |
|           |                                 |       |               |                  |        |              | Link Register              | CR       | C.JALR            | rsl            |                   | JALR    | ra,rs1           | ,0           |
|           |                                 |       |               |                  |        | System       | Env. BREAK                 | CI       | C.EBREA           | ĸ              |                   | EBREAL  | ĸ                |              |
|           | 3                               | 2-bit | Instr         | uction F         | ormats |              |                            |          | 16                | -bit (RVC)     | Instruc           | ction F | ormats           |              |
| 31        | 30 25.24                        | 21    | 20            | 19 15            | 14 12  | 11 8         |                            | CR       | 15 14 13<br>funct |                | 0 9 8 1<br>rd/rs1 | r 6 5   | 4 3 :<br>rs2     | 0 1 0<br>0 D |
|           | funct7                          | rs    | 2             | rs1              | funct3 | rd           | opcode                     | CI       |                   |                | rd/rsl            | -       | imm              | op           |
| I         | imm[11:0]                       |       |               | rsl              | funct3 | rd           |                            | CSS      | funct3            | im             |                   |         | rs2              | op           |
|           | im[11:5]                        | n     |               | rsl              | funct3 | imm[4:0]     | aleegae                    | CIW      | funct3            |                | imm               |         | rd'              | op           |
|           | imm[10:5]                       | n     | 2             | rsl              | funct3 | imn[4:1] imr |                            | CL       | funct3            | imm            | rs1'              | imm     | rd'              | op           |
| U         |                                 | imm 3 | :12]          |                  |        | rd           | openac                     | CS       | funct3            | imm            | rsl'              | imm     | rs2'             | op           |
|           | imm 10:1                        |       | imm[11]       | imml             | 12     | rd           | opcode                     | CB       | funct3            | offset         | rs1'              |         | offset           | op           |
| UJ imm[20 | li multori                      |       | mundare!      |                  |        | 1.0          | opcoure                    | CJ I     | funct3            |                | jump ta           | monet.  |                  | op           |

RISC-V Integer Base (RV32U64U1281), privileged, and optional compressed extension (RVC). Registers x1-x31 and the pc are 32 bits wide in RV321, 64 in RV641, and 128 in RV1281 (x0=0). RV64U1281 add 10 instructions for the wider formats. The RV1 base of <50 classic integer RISC instructions is required. Every 16-bit RVC instruction matches an existing 32-bit RV1 instruction. See risc.org.

|                  | x open                             | $\mathbf{\nabla}$ |                                   |                              |                  |              |                          | l (riscv.org)                                      |
|------------------|------------------------------------|-------------------|-----------------------------------|------------------------------|------------------|--------------|--------------------------|----------------------------------------------------|
|                  |                                    |                   | Optional                          | Multiply-Divide              | Instruc          |              |                          |                                                    |
| Category         | Name                               | Fmt               |                                   | ultiply-Divide)              |                  | +RV{6        |                          |                                                    |
| Multiply         | MULtiply                           | R                 | MUL                               | rd,rs1,rs2                   | MUL {W   D       | } 1          | rd,rs1,rs2               |                                                    |
|                  | ULtiply upper Half                 |                   | MULH                              | rd,rs1,rs2                   |                  |              |                          |                                                    |
|                  | iply Half Sign/Uns                 |                   | MULHSU                            | rd,rs1,rs2                   |                  |              |                          |                                                    |
| MULtip<br>Divide | ply upper Half Uns                 |                   | MULHU                             | rd,rs1,rs2                   |                  |              |                          |                                                    |
| Divide           | DIVide                             | R                 | DIV                               | rd,rs1,rs2                   | DIV{W D          | } 1          | rd,rs1,rs2               |                                                    |
| Remainder        | DIVide Unsigned<br>REMainder       | R                 | DIVU<br>REM                       | rd,rs1,rs2<br>rd,rs1,rs2     | REM{W D          |              | rd,rs1,rs2               |                                                    |
|                  | Mainder Unsigned                   |                   | REMU                              |                              |                  |              |                          |                                                    |
| her              |                                    |                   |                                   | rd,rs1,rs2                   | REMU{W           | a) -         | rd,rs1,rs2               |                                                    |
| Catacony         | Name                               | Fint              |                                   | ruction Extensio             | n: KVA           | +RV{6        | 4 4001                   |                                                    |
| Category         |                                    |                   |                                   | (Atomic)                     | LR. (D Q         |              | 4,128}<br>rd,rs1         |                                                    |
| Load<br>Store    | Load Reserved<br>Store Conditional | R                 | LR.W<br>SC.W                      | rd,rs1<br>rd,rs1,rs2         | BC. (D Q         |              | rd,rs1<br>rd,rs1,rs2     |                                                    |
| Store            | Store Conditional<br>SWAP          |                   | SC.W<br>AMOSWAP.W                 |                              |                  |              | rd,rs1,rs2<br>rd,rs1,rs2 |                                                    |
| Add              | ADD                                |                   | AMOSWAP . W                       | rd,rs1,rs2<br>rd,rs1,rs2     | AMOADD .         |              | rd,rs1,rs2               |                                                    |
| Logical          | XOR                                |                   | AMOXOR . W                        | rd,rs1,rs2                   | AMOXOR.          |              | rd,rs1,rs2               |                                                    |
| Logical          | AND                                |                   | AMOAND . W                        | rd,rs1,rs2                   | AMOAND .         |              | rd,rs1,rs2               |                                                    |
|                  | OR                                 | R                 | AMOOR .W                          | rd,rs1,rs2                   | AMOOR . (I       |              | rd,rs1,rs2               |                                                    |
| Min/Max          | MINimum                            |                   | AMOMIN.W                          | rd,rs1,rs2                   | AMOMIN.          |              | rd,rs1,rs2               |                                                    |
| min/ max         | MAXimum                            |                   | AMOMAX .W                         | rd,rs1,rs2<br>rd,rs1,rs2     | AMOMAX.          |              | rd,rs1,rs2               |                                                    |
| м                | INimum Unsigned                    |                   | AMOMINU.W                         | rd,rs1,rs2                   |                  |              | rd,rs1,rs2               |                                                    |
|                  | AXimum Unsigned                    | R                 | AMOMAXU, W                        | rd,rs1,rs2                   |                  |              | rd,rs1,rs2               |                                                    |
|                  |                                    |                   |                                   | uction Extensio              |                  |              |                          |                                                    |
| Category         | ee Optional Fl<br>Name             |                   |                                   | HP/SP,DP,QP FI Pt)           |                  | +RV{6        |                          |                                                    |
|                  | Move from Integer                  | R                 | FMV. (H 8) .X                     | rd,rs1                       | FMV. (D)         |              | rd,rs1                   |                                                    |
| Move .           | Move to Integer                    | R                 | FMV.X. (H  8)                     | rd,rs1                       | EMV. 1. (        |              | rd,rs1                   |                                                    |
| Convert          | Convert from Int                   | R                 | FCVT. (H S D Q)                   |                              |                  | S D 0}.      |                          |                                                    |
|                  | from Int Unsigned                  |                   | FCVT. (H S D Q)                   |                              |                  |              | {L T}U rd,rs1            |                                                    |
|                  | Convert to Int                     |                   | FCVT.W. (H S D                    |                              |                  |              | 3 D Q} rd,rs1            |                                                    |
| Conve            | rt to Int Unsigned                 |                   | FCVT.WU. (H S D                   |                              |                  |              | 8 D Q} rd,rs1            |                                                    |
| Load             | Load                               |                   | FL(W.D.0)                         | rd.rsl.imm                   |                  |              |                          | a Convention                                       |
| Store            | Store                              | S                 | F8{W,D,Q}                         | rs1,rs2,imm                  | Register         | ABI Nam      |                          | Description                                        |
| Arithmetic       | ADD                                | R                 | FADD. {8   D   Q}                 | rd,rs1,rs2                   | x0               | zero         |                          | Hard-wired zero                                    |
|                  | SUBtract                           |                   | FSUB. {8 D Q}                     | rd,rs1,rs2                   | <b>x1</b>        | ra           | Caller                   | Return address                                     |
|                  | MULtiply                           |                   | FMUL. {8 D Q}                     | rd,rs1,rs2                   | <b>x</b> 2       | sp           | Callee                   | Stack pointer                                      |
|                  | DIVide                             |                   | FDIV. {8 D Q}                     | rd,rs1,rs2                   | x3               | gp           |                          | Global pointer                                     |
|                  | SQuare RooT                        |                   | FSQRT. {8   D   Q }               | rd,rs1                       | x4               | tp           |                          | Thread pointer                                     |
| Mul-Add          | Multiply-ADD                       |                   | FMADD. {8 D Q}                    | rd,rs1,rs2,rs3               |                  | t0-2         | Caller                   | Temporaries                                        |
|                  | Multiply-SUBtract                  |                   | FMSUB. {8 D Q}                    | rd,rs1,rs2,rs3               | x8               | s0/fp        | Callee                   | Saved register/frame pointer                       |
|                  | Multiply-SUBtract                  |                   |                                   | rd,rs1,rs2,rs3               | x9               | s1           | Callee                   | Saved register                                     |
| Sign Inject      | stive Multiply-ADD<br>SIGN source  | R                 | FRMADD. {8 D Q}<br>FSGNJ. {8 D Q} | rd,rs1,rs2,rs3<br>rd,rs1,rs2 | x10-11<br>x12-17 | a0-1<br>a2-7 | Caller                   | Function arguments/return va<br>Function arguments |
|                  | ative SIGN source                  |                   | FSGNJN. {8 D 0}                   |                              | x18-27           | s2-11        | Callee                   | Saved registers                                    |
| ivey             | Xor SIGN source                    |                   | FSGNJX. {8 D Q}                   |                              | x28-31           | t3-t6        | Caller                   | Temporaries                                        |
| Min/Max          | MiNimum                            | R                 | FMIN. {8 D 0}                     | rd,rs1,rs2                   | 10-7             | ft0-7        | Caller                   | FP temporaries                                     |
| rini, riax       | MAXimum                            |                   | FMAX. {8 D 0}                     | rd,rs1,rs2                   | 18-9             | fs0-1        | Callee                   | FP saved registers                                 |
| Compare          | Compare Float =                    | R                 | FEQ. (8 D Q)                      | rd,rs1,rs2                   | f10-11           | fa0-1        | Caller                   | FP arguments/return values                         |
|                  | Compare Float <                    |                   | FLT. {8   D   Q}                  | rd,rs1,rs2                   | f12-17           | fa2-7        | Caller                   | FP arguments                                       |
|                  | Compare Float ≤                    |                   | FLE. (8 D Q)                      | rd,rs1,rs2                   | f18-27           | fs2-11       | Callee                   | FP saved registers                                 |
| Categorizat      | ion Classify Type                  |                   | FCLASS. (8 D 0)                   |                              | f28-31           | ft8-11       | Caller                   | FP temporaries                                     |
| Configuratio     | 1 11                               | R                 | FRCSR                             | rd                           |                  |              | Guiler                   |                                                    |
|                  | ad Rounding Mode                   |                   | FRRM                              | rd                           |                  |              |                          |                                                    |
| 1000             | Read Flags                         |                   | FRELACS                           | rd                           |                  |              |                          |                                                    |
|                  | Swap Status Reg                    |                   | FSCSR                             | rd,rs1                       |                  |              |                          |                                                    |
|                  |                                    |                   |                                   |                              | 1                |              |                          |                                                    |
| Swa              |                                    | R                 | ESEM                              | rd rsl                       |                  |              |                          |                                                    |
| Swa              | ap Rounding Mode                   |                   | FSRM                              | rd,rsl                       |                  |              |                          |                                                    |
|                  |                                    | R                 | FSRM<br>FSFLAGS<br>FSRMI          | rd,rsl<br>rd,rsl<br>rd,imm   |                  |              |                          |                                                    |

RISC-V calling convention and five optional extensions: 10 multiply-divide instructions (RV32M); 11 optional atomic instructions (RV32A); and 25 floating-point instructions each for single-, double-, and quadruple-precision (RV32F, RV32D, RV32D). The latter add registers f0-51, whose width matches the widest precision, and a floating-point control and status register fcsr. Each larger address adds some instructions: 4 for RVM. 11 for RVA, and 6 each for RVF/D/Q. Using regex notation, () means set, so L(D(Q) is both LD and LQ. See risc. org. (8/21/15 revision)

## **RISC-V** likely to flourish

RISC-V is likely to be the first highly successful open instruction set

Already lots of implementations, some open source, some not

Very successful already for cost sensitive embedded applications

Expect high performance multi-issue out of order versions for high end: SiFive, Ventana, etc

The open nature of the ISA is attractive to many

## Present day state of the art:

## Esperanto puts over a thousand RISC-V cores on a chip

## Esperanto's approach is different... and we think better for ML Recommendation

### Other ML Chip approaches



DRAM DRAM DRAM DRAM DRAM DRAM > 1000 > 1000 > 1000 > 1000 > 1000 > 1000 **RISC-V/Tensor RISC-V/Tensor RISC-V/Tensor RISC-V/Tensor RISC-V/Tensor RISC-V/Tensor** Cores Cores Cores Cores Cores Cores

Esperanto's better approach

One Giant Hot Chip uses up power budget Limited I/O pin budget limits memory BW Dependence on systolic array multipliers

- Great for high ResNet50 score
- Not so good with large sparse memory Only a **handful (10-20) of CPU cores**
- Limited parallelism with CPU cores when problem doesn't fit onto array multiplier
   Standard voltage: Not energy efficient

Use **multiple low-power** chips that still fit within power budget Performance, pins, memory, bandwidth **scale up with more chips Thousands** of general-purpose RISC-V/tensor cores

- Far more programmable than overly-specialized (eg systolic) hw
- Thousands of threads help with large sparse memory latency
  Full parallelism of thousands of cores always available
  Low-voltage operation of transistors is more energy-efficient
- Lower voltage operation also reduces power
- Requires both circuit and architecture innovations

Challenge: How to keep the power of each chip to < 20 watts?

Challenge was to put >1000 RISC-V Cores in a 20 Watt chip

Assumed half of 20W power for 1K RISC-V cores, so only 10 mW per core!

Power (Watts) = 
$$C_{dynamic} \times Voltage^2 \times Frequency + Leakage$$

|                                            | Power/core | Frequency | Voltage              | <u>Cdynamic</u>           |
|--------------------------------------------|------------|-----------|----------------------|---------------------------|
| Generic x86 Server core (165W for 24 cores | s) 7 W     | 3 GHz     | 0.850v               | 2.2nF                     |
| 10mW ET-Minion core (~10W for 1K cores)    | 0.01 W     | 1 GHz     | 0.425v               | 0.04nF                    |
| Reductions needed to hit goals             | ~700x      | Зx        | 4x                   | 58x                       |
|                                            |            | Easy      | Hard<br>Circuit/SRAM | Very Hard<br>Architecture |

#### Study of energy-efficiency and number of chips to get best ML Performance in 120 watts (six 20W chips)



## Cluster of CPUs (Shires) to Become Next Unit of Compute

For Machine Learning or any highly parallel application wanting to use hundreds to thousands of CPUs, much better to consider the *basic unit of compute to be a cluster of CPUs and memory*, rather than just isolated individual processors.

Gain dramatic advantages by designing CPUs to work together on large problems: new "Parallel CPU"

- Increase performance
- Reduce area with smaller cores
- Reduce power by enabling simpler design
- Further reduce power (~4x) through cooperation between cores, both data and instruction work

Need to consider memory issues

• L2, L2, L3 cache and path the main memory

Need to consider interconnect

NoC-to-NoC interconnect between clusters

Esperanto has a highly optimized solution working in silicon today as a great new "unit of compute"

## 32 ET-Minion CPUs and 4 MB Memory form a "Minion Shire" cluster



#### 32 ET-MINION RISC-V CORES PER MINION SHIRE

Arranged in four 8-core neighborhoods

### SOFTWARE CONFIGURABLE MEMORY

#### HIERARCHY

L1 data cache can also be configured as scratchpad Four 1MB SRAM banks can be partitioned as private L2, shared L3 and scratchpad

#### SHIRES CONNECTED WITH MESH NETWORK

#### **NEW SYNCHRONIZATION PRIMITIVES**

Fast local atomics Fast local barriers Fast local credit counter IPI support

## 8 ET-Minions form a "Neighborhood"

#### **NEIGHBORHOOD CORES WORK CLOSELY TOGETHER**

- Architecture improvements capitalize on physical proximity of 8 cores
- Take advantage that almost always running highly parallel code

#### **OPTIMIZATIONS FROM CORES RUNNNING THE SAME CODE**

- 8 ET-Minions share single large instruction cache, this is more energy efficient than many separate instruction caches.
- "Cooperative loads" substantially reduce memory traffic to L2 cache

#### NEW INSTRUCTIONS MAKE COOPERATION MORE EFFICIENT

- New Tensor instructions dramatically cut back on instruction fetch bandwidth
- New instructions for fast local synchronization within group
- New Send-to-Neighbor instructions
- New Receive-from-Neighbor instructions



## ET-Minion is an Energy-Efficient RISC-V CPU with a Vector/Tensor Unit

#### **ET-MINION IS A CUSTOM BUILT 64-BIT RISC-V PROCESSOR**

- In-order pipeline with low gates/stage to improve MHz at low voltages
- Architecture and circuits optimized to enable low-voltage operation
- Two hardware threads of execution
- Software configurable L1 data-cache and/or scratchpad

#### **ML OPTIMIZED VECTOR/TENSOR UNIT**

- 512-bit wide integer per cycle
  - 128 8-bit integer operations per cycle, accumulates to 32-bit Int
- 256-bit wide floating point per cycle
  - 16 32-bit single precision operations per cycle
  - 32 16-bit half precision operations per cycle
- New multi-cycle Tensor Instructions
  - Can run for up to 512 cycles with one tensor instruction (32K ops)
  - Reduces instruction fetch bandwidth and reduces power
  - RISC-V integer pipeline put to sleep during tensor instructions
- Vector transcendental instructions

#### **OPERATING RANGE: 300 MHz TO 2 GHz**



ET-Minion RISC-V Core and Tensor/Vector unit optimized for low-voltage operation to improve energy-efficiency

Optimized for energy-efficient ML operations. Each ET-Minion can deliver peak of 128 Int8 GOPS per GHz

### Shires are connected to each other and to external memory through Mesh Network



Dave Dizel - MICRO-55 Keynote -- From One to a Million RISC Processors

## ET-SoC-1: Full chip internal block diagram



## ET-SoC-1 External Chip Interfaces



#### 8-bit PCle Gen4

• Root/endpoint/both

#### 256-bit wide LPDDR4x

- 4267 MT/s
- 137 GB/s
- ECC support

#### **RISC-V SERVICE PROCESSOR**

- Secure Boot
- System Management
- Watchdog timers
- eFuse

#### EXTERNAL IO

- SMBus
- Serial I2C/SPI/UART
- GPIO
- FLASH

## Summary Statistics and Status of ET-SoC-1

#### • ET-SoC-1 is fabricated in TSMC 7nm

- 24 billion transistors
- Die-area: 570 mm<sup>2</sup>
- 1088 ET-Minion energy-efficient 64-bit RISC-V processors
  - Each with an attached vector/tensor unit
  - Typical operation 300 MHz to 1 GHz
- 4 ET-Maxion 64-bit high-performance RISC-V out-of-order processors
  - Typical operation 500 MHz to 1.0 GHz
- Over 160 million bytes of on-die SRAM used for caches and scratchpad memory
- ET-SoC-1 Power ~ 20 watts, can be adjusted for 10 to 60+ watts under SW control
- Package: 45x45mm with 2494 balls to PCB, over 30,000 bumps to die
- Status
  - First silicon is healthy and has been shipped to paying customers



ET-SoC-1 Die Plot



ET-SoC-1 Package

## Esperanto's ET-SoC-1 PCIe Evaluation Card



#### **Esperanto PCI Express Accelerator Card:**

- Single ET-SoC-1 chip running at 300 MHz to 1 GHz
- 1088 ET-Minion cores and 4 ET-Maxion cores
- 1056 ET-Minion cores provide acceleration
- 8 lanes of PCIe Gen 4
- ET-SoC-1 power can be configured from 10W to to 60W per chip depending on customer requirements
- 16GB or 32GB DRAM
- Typically used as accelerator to x86 host
- Now being sold in servers for customer evaluations

### Esperanto Minion Shire Power with ML Recommendation Benchmarks



## Today: Over 300,000 RISC-V Vector Processors in a single rack



320 Esperanto PCIe cards in a standard 42U high rack using 20 server chassis, leaving 2U for TOR switch.

## What's next for RISC?

## Chiplets and a lot more cores

### Industry standard UCIe bus promoting chiplet interoperability announced March 2, 2022





Standardized Chiplet bus is a big deal.

Will accelerate Chiplet development.

Chiplets will affect your future, so get ready.

JOIN US!

# Esperanto's Next Generation will use a Chiplet-based implementation

Easy for Esperanto to make chiplet products, cut into pieces and shrink from 7nm to 3nm



I/O Chiplet

General Purpose CPU Chiplet



ET-Minion Core Massively Parallel Compute Chiplets

### Cost Benefit Using Chiplets Gets More Compelling Every Foundry Node

#### Example: Relative cost of ONE monolithic 500 mm<sup>2</sup> die vs implementing as FOUR 125mm<sup>2</sup> chiplets



What might chiplets in a package look like in 2030? 4x today in a future node?



## Small

1 Host CPU chiplet

- 1 IO Chiplet
- 1 Compute Chiplet with:
- 4K RISC-V+vector cores
- 3D Memory

## Medium

- 2 Host CPU chiplet
- 2 IO Chiplet
- 4 Compute Chiplets with:
- 16K RISC-V compute cores
- 3D Memory

| Anticestate Constant Constant | Anna Variation Columna State of California                                                                                                                                                                                                                                        | An Example Foreigners Service N. 4. | Anticipanti and Constant and Co |
|-------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|                               |                                                                                                                                                                                                                                                                                   |                                     |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|                               |                                                                                                                                                                                                                                                                                   |                                     |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|                               |                                                                                                                                                                                                                                                                                   |                                     |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|                               |                                                                                                                                                                                                                                                                                   |                                     |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|                               | means      64243      64243      64243      64243        means      Verifit Company      64243      64243 | Canada Factore Factore Factore      | An HC Interfer Income Day                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|                               |                                                                                                                                                                                                                                                                                   |                                     |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|                               |                                                                                                                                                                                                                                                                                   |                                     |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|                               |                                                                                                                                                                                                                                                                                   |                                     |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|                               |                                                                                                                                                                                                                                                                                   |                                     |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| Host Host                     | Host Host                                                                                                                                                                                                                                                                         |                                     |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| CPU CPU<br>Chiplet Chiplet    | CPU CPU<br>Chiplet Chiplet                                                                                                                                                                                                                                                        | Chiplet                             | Chiplet Chiplet                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |

## Large

- 4 Host CPU chiplets
- 3 IO Chiplet
- 16 Compute Chiplets with:
- 64K RISC-V compute cores
- 3D Memory
- Optical IO

## Prediction: Over 1,000,000 cores in a small server by 2030

Today in 7nm we have over 1,000 cores per chip

Moore's law is slowing down, but not dead

By 2030 expect to have over 4,000 cores in a chiplet

Expect to put 16 chiplets together in a single package, ie 64,000 cores in a package

Put each package on a single accelerator board

Like we do today, put 16 of these boards in a small sever

That is over one million RISC-V cores, each with a vector unit, ready to run your workload

What will you do with over a million cores at your disposal. This is the next generation challenge.

One million cores might fit on just a few cards. Challenge is using them efficiently.



Example: 1 Million cores on 16 boards, each board with one package with 64K cores. If boards fit in similar size server as today, that's 20 million RISC-V cores in a rack.

## Energy cost to move data is the key challenge

Today's ET-SoC-1 vector units can consume ~100TB/s, caches feed them at 8 TB/s. If future chips are 4x as capable, how do we feed 4x, ie 32 TB/second as data set sizes grow?

| pJ/bit 💷 | TB/sec                                                                                                | Power Watts 🖃                                                                                                                                                                                       |
|----------|-------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 40       | 32                                                                                                    | 10240.0                                                                                                                                                                                             |
| 6.5      | 32                                                                                                    | 1664.0                                                                                                                                                                                              |
| 1.8      | 32                                                                                                    | 460.8                                                                                                                                                                                               |
| 1.5      | 32                                                                                                    | 384.0                                                                                                                                                                                               |
| 1.3      | 32                                                                                                    | 332.8                                                                                                                                                                                               |
| 1.0      | 32                                                                                                    | 256.0                                                                                                                                                                                               |
| 0.80     | 32                                                                                                    | 204.8                                                                                                                                                                                               |
| 0.56     | 32                                                                                                    | 143.4                                                                                                                                                                                               |
| 0.50     | 32                                                                                                    | 128.0                                                                                                                                                                                               |
| 0.30     | 32                                                                                                    | 76.8                                                                                                                                                                                                |
| 0.50     | 32                                                                                                    | 128.0                                                                                                                                                                                               |
| 0.25     | 32                                                                                                    | 64.0                                                                                                                                                                                                |
| 0.20     | 32                                                                                                    | 51.2                                                                                                                                                                                                |
| 0.10     | 32                                                                                                    | 25.6                                                                                                                                                                                                |
| 0.01     | 32                                                                                                    | 2.6                                                                                                                                                                                                 |
|          | 40<br>6.5<br>1.8<br>1.5<br>1.3<br>1.0<br>0.80<br>0.56<br>0.50<br>0.30<br>0.50<br>0.25<br>0.20<br>0.10 | 40    32      6.5    32      1.8    32      1.5    32      1.3    32      1.0    32      0.80    32      0.56    32      0.50    32      0.30    32      0.50    32      0.25    32      0.10    32 |

Data movement power may be far greater than compute power. Seems like stacking memory in 3D on top of compute chiplets may be a way to reduce excessive chip to chip data movement power.

## Summary

RISC vs CISC is no longer a popular debate, RISC has won

- RISC techniques are ubiquitous in most types of processors
- RISC-V is likely to be the common instruction set for much new RISC innovation
- RISC-V doesn't carry years of baggage, providing advantages in power, area, cost
- RISC-V started with simple embedded applications
- Will soon see 8+ issue RISC-V out-of-order cores rivaling the best of other ISA implementations
- For massively parallel compute, energy efficiency is key, and slower in-order with vector wins

Our future is likely to continue to use RISC techniques:

- Easy to program general-purpose RISC-V likely to win over specialized ISA & architectures
- Next big challenge likely to be methods to program millions of RISC-V cores efficiently
- Big opportunities in CPU micro-architecture to design more efficient clusters of RISC-V cores

# Thanks!

# End of Presentation

# Questions?