# Classifying and Evaluating Performance-relevant Parameters for Reconfigurable Processors

<u>Lars Bauer</u>, Muhammad Shafique, and Jörg Henkel

**Chair for Embedded Systems (CES)** 

**University of Karlsruhe** 



# **Development of Embedded Systems**

- Typical:
  - Static analysis of hot spots
  - Building tightly optimized system
- Nowadays:
  - Increasing complexity
  - More functionality
- Problem:
  - Statically chosen design point has to match all requirements
  - Typically inefficient for individual components (e.g. tasks or hot spots)



# Possible Solution: Extensible Processors

Efficiency: Mips/\$, MHz/mW, Mips/area, "Hardware solution" **ASIC:** Reconfigurable Compu-- Non-programmable ting: Processor with - highly speciali reconfigurable ISA, i.e. reconfigurable **Special Instructions** (extensible processor) "Software **General purpose** solution" processor Flexibility, 1/time-to-market, ...

#### Related Work: Reconfigurable Processors

- **□** [CoMPARE'98]:
  - Fine-grained reconfigurable fabric coupled to the core pipeline
  - Can implement a single Special Instruction (SI) at a time
- □ [CHIMAERA'00]:
  - Supports multiple SIs in the reconfigurable fabric at the same time
- **□** [MOLEN'04]:
  - Can be configured to support only a single or multiples SIs at the same time
- ☐ [RISPP'07]:
  - Supports multiple SIs at the same time and allows multiple (hardware) implementations per SI (providing different performance/area trade-offs)
  - Partitions SIs into Data Paths which are reconfigured independently
- Coarse-Grained Reconfiguration: ADRES, etc. (not the focus of this talk)

#### **Outline**

- Introduction
- ☐ Related Work
- Reconfigurable Processor Alternatives
  - Special Instruction-Based Categorization
  - Relevant Architectural Parameters
  - Design Space Exploration Tool
- □ Conclusion



# **SI-Based Categorization**

- Providing Special Instruction (SI) Implementations:
  - How many SIs may be available at the same time?
  - How many implementation alternatives exist per SI?
  - Note: each SI may be executed by the ISA of the core pipeline
- □ Technical constraints: Rectangular implementation of a hardware description (e.g. an SI)
  - The typical shape for place & route tools
- □ These rectangular implementations cannot be placed at arbitrary positions within the reconfigurable fabric
  - They are typically aligned to dedicated communication ports that are provided at fixed positions



## **SI-Based Categorization Overview**



Legend:

Core Pipeline (scaled down):



Reconfigurable area:

Communication System: Special Instruction Container (SIC):

Data Path Container (DPC):



# SI-Based Categorization: Category-1: Single SI Container



Corresponds to [CoMPARE'98]

- □ At most one SI is available in hardware at a given time
- □ Relatively long reconfiguration time, depending on the size of the SI Container
- □ Depending upon the required amount of logic two SIs might fit into the Container, but it is not supported
  - □ → Internal fragmentation

Legend:

Core Pipeline (scaled down):



Reconfigurable area:



Special Instruction Container (SIC):



# SI-Based Categorization: Category-2: Multiple SI Containers



□ Corresponds to [CHIMAERA'00] and [MOLEN'04]

- An SI may be loaded into any free container
- SIs may not be bigger than the container, even if not all containers are demanded
  - □ → external fragmentation (in addition to the internal fragmentation per SI Container)

Legend:

Core Pipeline (scaled down):



Reconfigurable area:



Special Instruction Container (SIC):



# **Example: Modular SIs (using Data Paths)**



# **SI-Based Categorization:** Category-3: Multiple overlapping SIs



- There is no predetermined maximum of supported SIs
- Multiple SIs may share common data paths (i.e. reuse them) because at most one SI is executed at a time.
- This addresses the internal and external fragmentation problem
- **Demand for internal** communication system

| Legend: Core Pipeline (scaled down): | Core Pipeline              | Reconfigu-<br>rable area:       | Special Instruction Container (SIC): |
|--------------------------------------|----------------------------|---------------------------------|--------------------------------------|
|                                      | Communica-<br>tion System: | Data Path Con-<br>tainer (DPC): |                                      |

### **SI-Based Categorization Overview**



Legend:

Core Pipeline (scaled down):



Reconfigurable area:

Communication System: Special Instruction Container (SIC):

Data Path Container (DPC):



#### 13 **Example: Modular SIs** (allowing for multiple Implementations)



- The 'Transform' Data path might be available once (i.e. readily reconfigured) and used 8 times to realize the SI functionality
- Or it might be available twice and both instances are used 4 times, or ...

# **SI-Based Categorization:**

### Category-4: Single Partitioned SI Container



- □ 1 SI Container, partitioned into n DP Containers that are connected with a communication system
- When more DPs finish reconfiguration, then a faster implementation of an SI may become available
- ☐ Shorter reconfiguration time than Category-1 (only the demanded DPs need to be reconfigured)
- ☐ But still internal fragmentation (at most 1 SI supported; independent of it's size)

| Legend: Core Pipeline (scaled down): | Core Pipeline | Reconfigu-<br>rable area: | Special Instruction Container (SIC): |
|--------------------------------------|---------------|---------------------------|--------------------------------------|
|                                      | Communica-    | Data Path Con-            |                                      |
|                                      |               | tion System:              | tainer (DPC):                        |

#### 15 **SI-Based Categorization:** Category-5: Multiple Partitioned SI Containers



- Shorter reconfiguration time
- ☐ Still internal fragmentation
- External fragmentation problems
- Additionally, if the DPs that are demanded by an SI are not in the same SI Containers, they can not be used together (e.g. to implement an SI)



# SI-Based Categorization: Category-6: Multiple DP Containers



Corresponds to [RISPP'07]

#### ■ Main differences to Category-3:

- SIs can be upgraded (due to multiple available SI implementations; like in Category-4 & 5)
- Decision how many DP Containers shall be spend for which SI can adapt at run time
- □ → Demands a run-time system

#### ■ Main diff. to Category-4 & 5:

- No external fragmentation
- Available DPs may be used for all SIs, i.e. not fixed to a certain SI Container

| Legend: Core Pipeline (scaled down): |               | Reconfigu-     | Special Instruction |
|--------------------------------------|---------------|----------------|---------------------|
|                                      | Core Pipeline | rable area:    | Container (SIC):    |
|                                      | Communica-    | Data Path Con- |                     |
|                                      |               | tion System:   | tainer (DPC):       |

## **SI-Based Categorization Overview**





Core Pipeline (scaled down):



Reconfigurable area:

Communication System: Special Instruction Container (SIC):

Data Path Container (DPC):



#### **Outline**

- Introduction
- ☐ Related Work
- Reconfigurable Processor Alternatives
  - Special Instruction-Based Categorization
  - Relevant Architectural Parameters
  - Design Space Exploration Tool
- Design Space Exploration
- □ Conclusion



#### Relevant Architectural Parameters

- □ Core Pipeline Frequency: f<sub>CPU</sub> [MHz]
- □ FPGA Frequency: f<sub>FPGA</sub> [MHz]
  - May differ, due to fabrication technology
- Data Memory Connection
  - Number of Memory Ports: P
  - Bit width per Memory Port: W [Bits]
- Reconfiguration Bandwidth: R [MB/s]
  - Determines time to reconfigure parts of the FPGA
  - Depends on the FPGA configuration port and the used memory



#### 20 **Design Space Exploration Tool: Overview Input Data and Connections**



- System C based simulator
- Input for pipeline is obtained from Instruction Set Simulator (ArchC)
- SI information is semi-automatically derived at compile time



# Design Space Exploration Tool: Internal Composition



# Design Space Exploration Tool: Detailed Run-time Analysis



#### **Outline**

- Introduction
- ☐ Related Work
- Reconfigurable Processor Alternatives
  - Special Instruction-Based Categorization
  - Relevant Architectural Parameters
  - Design Space Exploration Tool
- Design Space Exploration
- □ Conclusion



## **Benchmark Application: H.264 Video Encoder**



- Challenging Application with many computational Hot Spots
- Benchmarking 20 frames in QCIF resolution (176x144)
- □ The GPP (i.e. a Sparc-V8 without reconfigurable hardware) requires 10.6 seconds @ 100 MHz → 1.89 fps
  - 2.1 seconds @ 500 MHz → 9.52 fps



# **Special Instruction Overview**

| Parameter             | Value                   | Comment                                                                                                             |
|-----------------------|-------------------------|---------------------------------------------------------------------------------------------------------------------|
| # SIs                 | 9                       | 4/1 in/out register (e.g. for memory addresses)                                                                     |
| # Data Paths          | 10                      | 2/2 32-bit in/out values                                                                                            |
| SI composition        | 1 - 4 DPs               | Utilizing multiple instances per DP                                                                                 |
| SI memory accesses    | 0 – 128<br>words        | For some SIs the input from register file is sufficient, others work on data memory (using up to 2 ports á 128 bit) |
| DP Bitstream          | 42,719 -<br>43,638 Byte | Bitstream for partial reconfiguration on Xilinx Virtex-II xc2v6000 FPGA                                             |
| DP logic requirements | 16 – 192<br>slices      | Note: these readings correspond to the pure computational logic without the necessary interconnection overhead      |



# **Special Instruction Overview (cont'd)**

|                        | Special<br>Instr. | Implemented Data Paths            |
|------------------------|-------------------|-----------------------------------|
| Motion                 | SAD               | SAD_16                            |
| Estimation             | SATD              | QSub, Transform/HT_4, Repack, SAV |
|                        | (I)DCT            | Transform/DCT_4, Repack, (QSub)   |
| (Inverse)<br>Transform | (I)HT_2x2         | Transform/HT_2                    |
| 11 ansioi in           | (I)HT_4x4         | Transform/HT_4, Repack            |
| Motion<br>Compensation | MC_Hz_4           | PointFilter, Repack, Clip3        |
| Intra                  | IPred_HDC         | CollapseAdd, Repack               |
| Prediction             | IPred_VDC         | CollapseAdd                       |
| <b>Loop Filter</b>     | LF_BS4            | Cond, LF_4                        |

# **Evaluating Category-1 and 2**



Number of CLB Columns (determining amount of reconfigurable area)

# **Evaluating Category-4 and 5**



Summarizing 2016 Measurements

| Parameter       | Invest. Values |
|-----------------|----------------|
| Category        | 4, 5           |
| $f_{CPU}[MHz]$  | 100            |
| $f_{FPGA}[MHz]$ | 50, 100        |
| R: [MB/s]       | 33, 66, 100    |
| P:              | 1, 2           |
| W: [Bits]       | 32, 64, 128    |



**SI Container is not** sufficient

**Rather many DPCs no longer** lead to a slower execution time

- 5 ecial - 4 GP Conr - 3 or Of SP Conr - 2 uniber oction Conr - 2 uniber oction Conr - 2 uniber oction Conr - 3 ecial - 2 uniber oction Conr - 3 ecial - 2 uniber oction Conr - 3 ecial - 4 ecial - 5 ecial - 6 ecial - 7 ecial -

# **Evaluating Category-3, 5, and 6**



'Critical' Amount of DP Containers:

Category-5: 8 DPCs

Category-3: 5 DPCs

Category-6: 5 DPCs





### **Evaluating impact of FPGA Frequency** Observation: When insufficient FPGA resources are available (rather sequential computation), the CPU frequency has the higher impact. When sufficient FPGA resources are available (more parallel computation), the FPGA frequency has the higher impact. **50 MHz 100 MHz FPGA FPGA** 20 Number of Data Path Containers (DP L. Bauer Talk @ MPSoC'09, August 7th http://ces.univ-karlsruhe.de/bauer

# **Evaluating Data Memory Connection**



- □ For a given total bit width the 2-port data memory always outperforms the 1-port memory
- □ Data Memory Connection affects potential parallelism
   → affects the relevance of CPU and/or FPGA frequency



# **Evaluating Reconfiguration Bandwidth**





## **Summary & Conclusion**

- □ Reconfigurable Processors are a promising approach for challenging and/or dynamically changing applications
- ☐ They can be categorized according their implementation of Special Instructions
  - How many SIs may be available at the same time?
  - How many implementation alternatives exist per SI?
  - Covering existing architectures and unveiling further ones
- □ Furthermore, different architectural parameters affect the performance of the system
- ☐ These settings interfere with each other (e.g. data memory connection and CPU/FPGA frequency)
  - This talk highlighted which parameters are relevant in which situation, based on an exhaustive design space exploration



# Classifying and Evaluating Performance-relevant Parameters for Reconfigurable Processors

<u>Lars Bauer</u>, Muhammad Shafique, and Jörg Henkel

# Thank you for your attention!

Chair for Embedded Systems (CES)
University of Karlsruhe

