# leti <u>ceatech</u>

1010101 01010101 01010101

10101 0011 011 010

 $10^{0}0^{1}1$  $11^{0}00$  $00^{0}1^{1}1$ 

1000 001 0011 0101 110100 11011

 $0^{0}0^{0}0$  $1^{1}1^{1}1$  $^{0}0^{0}1_{111}$ 1110 '0 100 011 011 000 1 1 00 111 00 1 111 0<sup>0</sup> 111 010 101 1  $\begin{array}{c} 11101011\\ 00001100\\ 100001\\ 0001\\ 01110\\ 0001\\ 011110\\ 011110\\ 011110\\ 011110\\ 011110\\ 011110\\ 011110\\ 011110\\ 011110\\ 011110\\ 011110\\ 011110\\ 011110\\ 011110\\ 011110\\ 011110\\ 011110\\ 011110\\ 011110\\ 011110\\ 011110\\ 011110\\ 011110\\ 011110\\ 011110\\ 011110\\ 011110\\ 011110\\ 011110\\ 011110\\ 011110\\ 011110\\ 011110\\ 011110\\ 011110\\ 011110\\ 011110\\ 011110\\ 01110\\ 01110\\ 01110\\ 01110\\ 01110\\ 01110\\ 01110\\ 01110\\ 01110\\ 01110\\ 01110\\ 01110\\ 01110\\ 01110\\ 01110\\ 01110\\ 01110\\ 01110\\ 01110\\ 01110\\ 01110\\ 01110\\ 01110\\ 01110\\ 01110\\ 01110\\ 01110\\ 01110\\ 01110\\ 01110\\ 01110\\ 01110\\ 01110\\ 01110\\ 01110\\ 01110\\ 01110\\ 01110\\ 01110\\ 01110\\ 01110\\ 01110\\ 01110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 0110\\ 010\\ 010\\ 010\\ 010\\ 010\\ 010\\ 010\\ 010\\ 010\\ 010\\ 010\\ 010\\ 010\\ 010\\ 010\\ 010\\ 010\\ 010\\ 010\\ 010\\ 010\\ 010\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\ 000\\$ 01 11 0<sup>0</sup> 0011111 110000 00111 00011 1 11 000 0 1 000011 111001 000010 000010 01110 000010 01110 0000 0110 0000 11100 0000 0 0 10<sup>1</sup>0<sup>1</sup> 111 000 1 1 0 0 1 1<sup>1</sup>0<sup>0</sup> 0<sup>0</sup> 1 1 111 10 00 11 00 0 0 1 1 1 0 0 11 00 111 00 11 00 11 Ō 0<sup>0</sup>1 1<sup>1</sup>0 0,0 Ó 001 Ó 

# **ACCURACY-ENERGY TRADE-OFF WITH DYNAMIC ADEQUATE OPERATORS**

MPSoC 2017 | Anca Molnos | 06/07/2017

ŧ

•



- Context: adequate/approximate computing
- Hardware
  - Design methodology for dynamic—accuracy operators
- Software use-case
  - Trading sharpness with energy consumption in an autofocus application
- Conclusions



# WHY APPROXIMATE COMPUTING



Many emerging applications are error tolerant (or error resilient)

• E.g., human perception, noisy input, recognition/classification algorithms

- Opportunity: increased performance or reduced energy consumption without significant impact in the quality of result
  - Probabilistic computing, imprecise computing, approximate computing, ...
  - Adequate computing = run-time accuracy-energy trade-off
    - Accommodate time-varying error tolerance
  - This idea may be applied at various levels: algorithm, data encoding, arithmetic precision, **operator implementation**, ...



# STATE-OF-THE-ART

# **1.** Approximate/inexact circuits:

- Mostly adders and multipliers
- Kyaw, Goh and Yeo, EDSSC'10, Huang, Lach and Robins, DAC'12, Farshchi, Saeed and Fakhraie, CADS'13, Jiang, Han and Lombardi, GLSVLSI'15, Bhardwaj, Mane and Henkel, Int'l Symp. on QED'15, A Lingamneni, et al. TECS'13, etc.

# **2.** Approximate synthesis:

- Generalization of the previous techniques to any netlist
- Shin and Gupta, ATS'08, Venkataramani et al, DAC'12, Miao, Gerstlauer and Orshansky, ICCAD'13, etc.
- **3.** Quality-configurable circuit architectures:
  - Voltage over-scaling + error correction
  - Adders, multipliers, etc.
  - De la Guya Solaz, Han, Conway, IEEE Trans. On CAS'11, Kahng and Kang, DAC'12, Ye et al, ICCAD'13, Liu, Han and Lombardi, DATE'14, etc.
  - Voltage scalable meta-functions.
  - Mohapatra, Chippa, Raghunathan and Roy, DATE'11

# **4.** Dynamic Voltage and Accuracy Scaling (DVAS):

- Use *technological knobs* only (no design modifications)
- Reduce bit-width of input operands (LSB bits to 0) and voltage down-scaling
- Moons and Verhelst, ISLPED'15, etc.

# **DYNAMIC ACCURACY OPERATORS**

# • Limitation of DVAS:

leti

Ceatech

- In standard ASIC implementation flow: most timing paths have a delay close to the critical one
- Lower VDD → the entire operator is slowed-down.



#### Need fine-grain power/delay tuning!

One possibility: multi-VDD

- Requires level shifters
- Excessive power overhead for an FU

#### Alternative:

- Combine DVAS with FDSOI's back bias scaling.
- Fine-grain threshold voltage (Vth) tuning



D. Pagliari et al. "A Methodology for the Design of Dynamic Accuracy Operators by Runtime Back Bias", DATE 2017



# CIRCUIT PARTITIONING IN $V_{\rm BB}$ DOMAINS

- **Proposed partitioning:** regular tiling
  - Start from placed design
  - Divide the area in N identical  $V_{BB}$  domains
  - Assign each cell to the "closest" domain



#### • Pros:

- Regularity of design
- Easy to incorporate in an EDA flow
- Minimal displacement of cells w.r.t. a placement with standard constraints
  Minimal timing, area and power overheads at maximum bit-width.

#### • Cons:

- For a given accuracy, if a single domain contains cells that require different Vth values, all cells must receive the lowest Vth
- Leakage overhead!



# **IMPLEMENTATION FLOW – 2 PHASES**

#### Implementation Phase

- Partition circuit in VBB domains using regular tiling.
- Incremental placement:
  - Insert well-taps
  - Fix possible constraints violations in nominal operating conditions due to cell displacement.

# Analysis&Optimization Phase

- Exhaustive exploration of all possible configurations of Accuracy,  $V_{\rm BB},$  and possibly global  $V_{\rm DD}$
- Static Timing Analysis (STA) to prune unfeasible configurations (timing violations)
- Power analysis on feasible configurations
- Many configurations (thousands), but fast analysis. Feasible for < 10-15  $V_{BB}$  domains



#### leti EXPERIMENTAL RESULTS

## Implemented several operators circuits (16-bit fixed-point)

- Booth multiplier, FFT butterfly unit, 30-tap FIR filter
- Technology: 28nm UTBB FDSOI from ST Microelectonics

# **Operating conditions:**

C22tech

- $V_{DD} = [0.6V, 0.7V, \dots 1.0V]$
- Forward BB:  $V_{BB} = \pm 1.1V$  (N-Well/P-Well)
- Nominal condition (for 1st placement): 1.0V + FBB in all domains. ۲



| 10



- Context: adequate/approximate computing
- Hardware
  - Design methodology for dynamic—accuracy operators
- Software use-case
  - Trading sharpness with energy consumption in an autofocus application
- Conclusions



#### • Auto-focus control loop on a region of interest (ROI)

- adaptive, integral controller
- sharpness → modified Haar-wavelet and normalisation
  - only the horizontal and vertical high-pass components of the transform are taken into the consideration
  - simple computation (many addition operations and few division operation)
- Zarudniev et al. NEWCAS'15





- Baseline: precise additions, energy level = 1
- 39 images, ROI with 256x256 pixels  $\rightarrow$  17K addition operations
  - 16K in Haar transform
  - 1K in norm {1
- 2 cases
  - all operations (Haar wavelet and normalization) are approximate
  - additions in the Haar wavelet are approximate and the normalization is precise



30% reduction on energy  $\rightarrow$  sharpness degradation of only 2%.

# **EXAMPLE OF SHARPNESS DEGRADATION**

#### Maximum sharpness

leti

ceatech



Relative sharpness error 25%



#### Relative sharpness error 10%



#### Relative sharpness error 50%





#### • Summary and perspectives:

- Back-Bias is an effective knob for fine-grain delay/power tuning in qualityconfigurable functional units.
  - First ever application of Back-Biasing to quality-configurable systems (to our knowledge).
  - Easy to integrate with EDA flows
  - Many avenues for optimization: alternative partitioning techniques (irregular tiling), avoid exhaustive analysis of all  $V_{DD}$ ,  $V_{BB}$ , and accuracy combinations, ...
  - Compare with emerging multiple-precision fixed and floating point approaches
  - Approximate/adequate memory and storage
- Energy gain potential for a small loss in final image sharpness in autofocus use-case.
  - More applications to be investigated

# Challenges at application level

- Analysis/programming/compiler support for adequate operations
  - Methods to find sensitive application parts
  - Analysis of error propagation for large applications
- Algorithmic changes for better error-resilience



#### ACKNOWLEDGEMENTS

- Yves Durand, CEA-Leti, France
- Edith Beigne, CEA-Leti, France
- Daniele Jahier Pagliari, Politecnico di Torino, Italy
- Massimo Poncino, Politecnico di Torino, Italy

# Thank you!

Leti, technology research institute Commissariat à l'énergie atomique et aux énergies alternatives Minatec Campus | 17 rue des Martyrs | 38054 Grenoble Cedex | France www.leti.fr

