#### PICO: ASIC Synthesis from C

#### Rob Schreiber Shail Aditya Bob Rau Vinod Kathail Scott Mahlke Darren Cronquist Mukund Sivaraman

#### HP Labs, Palo Alto

Colore the ADD and Manual and Table 2002

#### Outline

- What Can PICO Do for an SOC Designer?
- The PICO System Design Hierarchy
- From Sequential to Parallel Loop Nest
- Parallel Loop Nest to Processor Design

#### **PICO overview**



Colonailean AADaaa Waalaalaan Tulu 2002

### Using PICO

- User provides application, test data, and design space limits
- User indicates hot loop nests
- PICO creates Pareto set of ASIP designs.
- Each design has a customized VLIW with zero or more loop nests realized in HW
- User selects appropriate design for SOC based on area, power, performance tradeoff

#### PICO's ASIP Architecture



Colore the AAD and M/ solved and Table 2002

#### Hierarchical Design Frameworks

Colora il an ADA AVA alasha Tala 2002

#### An Automated Design Template

Parameter Ranges Function Specification

SpaceWalker

Constructor

**Evaluator** 

Pareto Filter

#### Good Systems from Good Subsystems



System Constructor

System Evaluator

System Pareto Filter

Colonailean AADaaa Maalaalean Tulu 2002

#### design space exploration



#### PICO GUI

| X PIC               | O: A Co                                       | npiler-Guio                                                          | led Process | or Design Too | ol in the second se |     |      | _ 🗆 × |  |  |  |  |  |  |
|---------------------|-----------------------------------------------|----------------------------------------------------------------------|-------------|---------------|----------------------------------------------------------------------------------------------------------------|-----|------|-------|--|--|--|--|--|--|
| <u>F</u> ile        | e <u>O</u> ptions <u>D</u> esign              |                                                                      |             |               |                                                                                                                |     |      |       |  |  |  |  |  |  |
|                     | PICO: A Compiler-Guided Processor Design Tool |                                                                      |             |               |                                                                                                                |     |      |       |  |  |  |  |  |  |
|                     | GUI adapted from TU Delft's MOVE system       |                                                                      |             |               |                                                                                                                |     |      |       |  |  |  |  |  |  |
| $\Delta$            |                                               |                                                                      |             |               |                                                                                                                |     |      |       |  |  |  |  |  |  |
|                     | Applic                                        | Application project file:<br>/car/scratchy/demo2000/pico/apps/ipeg99 |             |               |                                                                                                                |     |      |       |  |  |  |  |  |  |
|                     | 100                                           |                                                                      |             |               |                                                                                                                |     |      |       |  |  |  |  |  |  |
|                     | VLSI i<br>/ca                                 | VLSI model file:<br>/car/scratchy/demo2000/pico/Models/model25       |             |               |                                                                                                                |     |      |       |  |  |  |  |  |  |
|                     | Design task:                                  |                                                                      |             |               |                                                                                                                |     |      |       |  |  |  |  |  |  |
|                     | All                                           |                                                                      |             |               |                                                                                                                |     |      |       |  |  |  |  |  |  |
|                     |                                               |                                                                      |             |               |                                                                                                                |     |      |       |  |  |  |  |  |  |
|                     | Pareto<br>All                                 | Pareto solutions:<br>All Systems                                     |             |               |                                                                                                                |     |      |       |  |  |  |  |  |  |
|                     |                                               | A noo zw. Dowformu ou oo                                             |             |               |                                                                                                                |     |      |       |  |  |  |  |  |  |
|                     | 10000                                         | Area vs. Performance                                                 |             |               |                                                                                                                |     |      |       |  |  |  |  |  |  |
|                     |                                               | 10.0                                                                 |             | ALL WALL      |                                                                                                                |     | 1001 |       |  |  |  |  |  |  |
|                     |                                               | 10.0                                                                 |             |               |                                                                                                                |     |      |       |  |  |  |  |  |  |
|                     | MHz                                           | -                                                                    |             |               |                                                                                                                |     |      |       |  |  |  |  |  |  |
|                     | 200                                           | 5.0 —                                                                |             |               |                                                                                                                |     |      |       |  |  |  |  |  |  |
|                     | 0                                             |                                                                      |             |               |                                                                                                                |     |      |       |  |  |  |  |  |  |
|                     | ns/se                                         | 0.0-                                                                 |             |               |                                                                                                                |     |      |       |  |  |  |  |  |  |
|                     | t (m                                          |                                                                      |             |               |                                                                                                                |     |      |       |  |  |  |  |  |  |
|                     | ndub                                          | -5.0                                                                 |             |               |                                                                                                                |     |      |       |  |  |  |  |  |  |
|                     | nort                                          |                                                                      |             |               |                                                                                                                |     |      |       |  |  |  |  |  |  |
|                     |                                               |                                                                      |             |               |                                                                                                                |     |      |       |  |  |  |  |  |  |
|                     |                                               | -10.0                                                                | · · ·       |               | 1                                                                                                              |     |      |       |  |  |  |  |  |  |
|                     |                                               |                                                                      | 0.0         | -5.0          | Area (mm^2)                                                                                                    | 5.0 | 10.0 |       |  |  |  |  |  |  |
|                     |                                               |                                                                      |             |               |                                                                                                                |     |      |       |  |  |  |  |  |  |
| $\overline{\nabla}$ | -                                             | Area / Time Area / Throughput                                        |             |               |                                                                                                                |     |      |       |  |  |  |  |  |  |
|                     |                                               |                                                                      |             | 1             |                                                                                                                |     |      |       |  |  |  |  |  |  |
|                     |                                               |                                                                      | Quit        |               |                                                                                                                | Hel | p    |       |  |  |  |  |  |  |
|                     |                                               |                                                                      |             |               |                                                                                                                |     |      |       |  |  |  |  |  |  |

#### Limiting the Design Space

| 🕱 EPIC-Only + Hybrid De                  | sign Space Explorer: '/car/s                    | cratchy/demo2000/.chipo  | tle/data/jpeg99/transient/ | explore_range'                |                 |             | _ 🗆 ×        |  |  |  |  |  |  |
|------------------------------------------|-------------------------------------------------|--------------------------|----------------------------|-------------------------------|-----------------|-------------|--------------|--|--|--|--|--|--|
| <u>F</u> ile <u>O</u> ptions             |                                                 |                          |                            |                               |                 |             | <u>H</u> elp |  |  |  |  |  |  |
| EPIC-Only + Hybrid Design Space Explorer |                                                 |                          |                            |                               |                 |             |              |  |  |  |  |  |  |
| Predication:                             | Speculation:                                    | EPIC Type:               | Systolic:                  | Level 1 data cache parameters |                 |             |              |  |  |  |  |  |  |
| v general                                | <ul> <li>general</li> <li>restricted</li> </ul> | heterogeneous            | without                    | Cache sets:                   | low: 128        | high: 128   | sten         |  |  |  |  |  |  |
| <ul> <li>both</li> </ul>                 | 🕹 both                                          | ÷                        | 🔶 both                     | Accodiativity                 |                 | high: 2     | cton         |  |  |  |  |  |  |
| Functional Units                         |                                                 |                          |                            | Associativity.                |                 | ingn.  ∠ 💌  | step         |  |  |  |  |  |  |
| Integer:                                 | low: 1                                          | high: 8                  | step: 1                    | Line size:                    | 10W:  16 🛒      | nign:  32 🚽 | step         |  |  |  |  |  |  |
| Float:                                   | low:                                            | high:                    | sten: 1                    |                               |                 |             |              |  |  |  |  |  |  |
| Mamanu                                   |                                                 | high:                    | step. 1                    |                               |                 |             |              |  |  |  |  |  |  |
| метогу:                                  | iow:                                            | nign:  4 🔽               | step:                      | Level 1 instruction c         | ache parameters |             | (            |  |  |  |  |  |  |
| Branch:                                  | low: 1                                          | high:  1 🚽               | step: 1                    | Cache sets:                   | low: 128 🚔      | high: 128 🚔 | step         |  |  |  |  |  |  |
| Register Files                           |                                                 |                          |                            | Associativity:                | low: 2          | high: 2     | step         |  |  |  |  |  |  |
| Integer:                                 | low: 32                                         | high: 128 🚔              | step: 32                   | Line size:                    | low: 32         | high: 128   | step         |  |  |  |  |  |  |
| Float:                                   | low: 32                                         | high: 32 🚔               | step: 16                   |                               |                 |             |              |  |  |  |  |  |  |
| Predicate:                               | low: 32 🚔                                       | high: 32 🗘               | step: 16 🚔                 |                               |                 |             |              |  |  |  |  |  |  |
| Branch:                                  | low: 16 🚔                                       | high: 16 🚔               | Level 2 unified cache      | e parameters                  |                 |             |              |  |  |  |  |  |  |
| Svetolic Array                           |                                                 |                          |                            | Cache sets:                   | low: 256 🚔      | high: 256 🗘 | ster         |  |  |  |  |  |  |
| Systone Array                            |                                                 |                          |                            | Associativity:                | low: 2          | high: 3 🚔   | ster         |  |  |  |  |  |  |
| Systolic Process                         | ors: low: 1                                     | <pre>     high:  2</pre> | Line size:                 | low: 64 븆                     | high: 128 🗘     | ster        |              |  |  |  |  |  |  |
| II:                                      | low: 1                                          | 🗘 high: 8 🌻              | step: 1                    |                               |                 |             |              |  |  |  |  |  |  |
| Memory Ports:                            | low: 1                                          | 🛨 high: 1 韋              | step: 1                    |                               |                 |             |              |  |  |  |  |  |  |
| Search mode:                             | <ul> <li>Show final Pareto</li> </ul>           | ♦ Show intermediat       | e Paretos                  | ,                             |                 |             |              |  |  |  |  |  |  |
|                                          | Explore                                         |                          | Quit                       |                               |                 | Help        |              |  |  |  |  |  |  |

#### Exploration

150.0

200.0



#### Pareto Optimal Machines: VLIW-only



#### Pareto Optimal Machines: All systems



#### Systolic Design: Exploration



Synthesis of a Non-Programmable, Application-Specific Accelerator:

From Sequential Loop Nest to Parallel Loop Nest

Colore the AD and M/ solution Table 2002

### Input Language

- A perfect loop nest  $\rightarrow$  A systolic array
- A sequence of nests  $\rightarrow$  A pipeline of arrays
- Constant loop bounds
- Dependence analysis must be feasible:
  - No aliasing through pointers
- Language extensions
  - #pragma bitsize x 12
  - #internal coeff

## From C to VHDL Sequential C loop nest Sequential loop nest, tiled and register promoted Iteration scheduled, parallel loop nest Function units and software pipelined loop nest Registers, interconnect, FUs, memory Verilog/VHDL Design

Colore the ADD and Manalash Tole 2002

#### From C to VHDL



Tiles, schedules, maps, transforms loops, eliminates loads/stores

Optimizes, analyzes bitwidth, allocates function units, software pipelining

Allocates registers and interconnect. Builds VHDL description of processor.

Colora itan ADara Maralada Tala 2002

# What does it take to make this efficient?

Colore the AAD and MAAndraham Table 2002

#### The Memory Wall



Memory

CPU

Colonailean ADass Manladaan Tu

#### Cache and Local Memory



```
Goal of Code Transformation
for each TILE
  for (t = 0; t < Tfinal; t++)
    forall processors p
       X[t][p] = . . .
          Y[t-1][p+1] . . .
```

#### **Tiling the Iteration Space**



Volume/Surface = O(radius)

```
Computation/Footprint = \Omega(radius)
```

```
Computation/Footprint = CPU/Memory
```

#### Load/Store Elimination

- For affine array references, intermediate results in registers
- For affine, read-only array references, data routed through registers; no value loaded more than once.

#### **Tile Shapes**

#### Big tiles $\rightarrow$ More local memory

## Small tiles $\rightarrow$ less reuse of data, more global memory bandwidth

#### Optimal tile → smallest tile that does not oversubscribe memory bandwidth

#### Estimating the Footprint



Affine array reference X[i+j][2\*j-3\*k]

How many integer points in an affine image of a rectangular iteration space?

# Example: the Affine Image of an Iteration Space



#### **Corrected Estimates**

Published bounds on the size of the image of a Z-polytope are <u>wrong</u>
Our corrections:

- footprint = iteration space for 1-1 mappings
- 1-1 if no integer null vector in the iteration space
- corrected bounds from finding number of iterations that differ by a null vector
- within 20 percent in practice

#### Reindexing to Reduce Local Memory



#### Finding the Parallel Iteration Schedule



- Processors a mesh of processors is given
- *Initiation Interval (II)* every processor starts an iteration periodically with period equal to II (*hardware* pipelining)
- *Mapping* clusters of iterations are mapped to each processor
- Schedule one iteration per processor every II cycles
- *Honor* data dependence constraints
- Find the schedule via efficient direct search method

Calmainan ADaga Wantahan Tulu 2002

#### Hardware/Software Pipelining

for (i=0; i < 100; i++) a[i] += b[i]\*c[i]</pre>



time

#### Lower Bounds on II (RecMII, ResMII)

Colora il an AADaa Maladaan Tala 2002



#### A Tight Schedule: (i,j) --> 2i+3j



#### Tight Schedules – Prior Work

Darte/Delosme, Chen/Megson.

- *GIVEN*: Iteration space, projection direction, linear schedule
- *DETERMINE*: The allowed cluster shapes
- Tail Wags Dog!

#### **Constructing the Schedule**



Colore the AAD and MA colored and Table 2002

#### **Processor Synthesis**



- Optimize the loop body
- Analyze bitwidth of all values
- Allocate the function units
- Map operations to function units
- Schedule operations
- Allocate registers and memory
- *Interconnect* communicating elements

#### Parallel, custom, designed to spec: EFFICIENT!

#### Bitwidth analysis - basic idea

Input information limits the amount information that can be produced



Information required by consumers limits the amount that must be produced

Colona il an AADaa M/anladaan Tulu 2002

#### **Optimal FU allocation**



# MILP: minimize cost subject to sufficient capacity

Colora il an ADasa Manlada en Tula 2002

#### Allocation and Op Scheduling

Given: Inner loop and II

**Find**: Cheapest processor that achieves II on the loop



Colore the AAD as AA/ solarly Table 2002

#### Conclusions

- Accurate static analysis of memory bandwidth optimal tiling
- Linear iteration scheduling: solved problem
- Efficient datapath synthesis a hard problem, good heuristics
- Automatic NPA synthesis is practical
- Automatic synthesis of full embedded systems is feasible, too

#### Related pubications :

Robert Schreiber, Shail Aditya, Scott Mahlke, Vinod Kathail, B. Ramakrishna Rau, Darren Cronquist, and Mukund Sivaraman. PICO-NPA: High-level synthesis of nonprogrammable hardware accelerators. In Journal of VLSI Signal Processing 31: 127-142 (2002).

Shail Aditya, B. Ramakrishna Rau, and Vinod Kathail. Automatic architecture synthesis of VLIW and EPIC processors. In Proceedings of the 12th International Symposium on System Synthesis, San Jose, California, pp. 107--113, November 1999.

Alain Darte, Robert Schreiber, B. Ramakrishna Rau, and Frederic Vivien. Constructing and exploiting linear schedules with prescribed parallelism. ACM Transactions on Design Automation for Electronic Systems, 7(1), (2002)

Kyle Gallivan, William Jalby, and Dennis Gannon. On the problem of optimizing data transfers for complex memory systems. In Proceedings of the 1988 ACM International Conference on Supercomputing, pp. 238--253, 1988.

Scott Mahlke, Rajiv Ravindran, Michael Schlansker, Robert Schreiber, and Timothy Sherwood.

Bitwidth cognizant architecture synthesis of custom hardware accelerators. IEEE Transactions on Computer-Aided Design of Circuits and Systems, 20(10):1-17, 2001.

William Pugh.

The Omega test: a fast and practical integer programming algorithm for dependence analysis. Communications of the ACM, 35(8):102--114, 1992.

Patrice Quinton and Yves Robert. Systolic Algorithms and Architectures. Prentice Hall International (UK) Ltd., Hemel Hempstead, England, 1991.

B. Ramakrishna Rau. Iterative modulo scheduling. International Journal of Parallel Processing, 24:3--64, 1996.

B. Ramakrishna Rau, Vinod Kathail, and Shail Aditya. Machine-description driven compilers for EPIC and VLIW processors. Design Automation for Embedded Systems, 4:71--118, 1999.