

# Processing energy vs. perfomance: Conclusions from a multi-core design

Gerd Ascheid, F. Borlenghi, M. Witte Institute for Communication Technologies and Embedded Systems









# **Optimum Receiver Processing**



**Optimum processing unfeasible, practical approach:** 

- Optimum estimation + likelihood information passed from left to right
- Potential performance gain by iterative optimization (like in Turbo and LDPC decoding)





*<i>i*CE

# Case Study: A MIMO Doubly Iterative Receiver



### **MIMO** doubly iterative receiver

- Two iteration types
  - Outer iteration ( $\mathcal{OI}$ ): MIMO detector  $\Leftrightarrow$  channel decoder
  - $\bigcirc$  Inner iteration (  $\mathcal{II}$  ): inner iteration of channel decoder
- Their execution complexity and time relation  $C_{OII} = a \times C_{III}$   $T_{OII} = b \times T_{III}$  a > 1 b > 1













## IteRX

- A 2.78 mm<sup>2</sup> 65 nm CMOS Gigabit MIMO Iterative Demapping and Decoding (IDD) Receiver
  - Lead designers: F. Borlenghi, E. M. Witte
  - in collaboration with A. Burg, EPFL







UMIC







*<b>7 ICE* 











#### ⇒ Alignment problem!

- Reorder and pack vectors? NO! 1<sup>st</sup> vector may finish last
- Serialize access (1 LLR/cycle)? NO! Limits max system throughput

Specialized alignment unit and custom memory structure to achieve a 1 vector/cycle throughput









### **Implementation Results**

- First silicon implementation of MIMO IDD baseband
- Core area: 2.78 mm<sup>2</sup> (1.58 MGE) in low-power 65 nm technology
  - Detector (5 SDs): 872 kGE (55%), 140 to 145 kGE / SD
  - Decoder: **447 kGE** (28%)
  - LLR memory: **210 kGE** (13%)
- Runtime flexibility
  - {2x2, 3x3, 4x4} antennas
  - {4, 16, 64} QAM
  - All 802.11n LDPC codes
- Max. frequencies @ 1.2 V
  - Detector: 135 MHz
  - Decoder: 299 MHz
- Avg. power (4x4, 64 QAM)
  - Detector: 175 to 245 mW
  - Decoder: 120 to 140 mW
- Max. throughput > 1 Gbit/s











# **Execution Time Dependencies**

### SISO MIMO demapper

 Sphere detector gets slower with decreasing SNR because less branches are pruned (i.e. more are evaluated)

#### LDPC decoder

- LDPC decoder gets slower with decreasing SNR because more iterations are required for convergence
- There is a maximum number of reasonable iterations for low SNR (if convergence has not yet occured it will likely never be achieved)

### IDD iterations

Number of necessary iterations increases with decreasing SNR



















What is the maximum throughput that can be achieved with given hardware resources (energy/bit or gate count)?





*<b>7 ICE* 







