



Breaking the Interleaving Bottleneck in Communication Applications for Efficient SoC Implementations

Norbert Wehn wehn@eit.uni-kl.de

MPSoC'05, 11-15 July 2005 Relais de Margaux Hotel, France







# **Generic Decoding Structure**



- 1. Subdecoder 1 processes one complete block
- 2. After finishing the calculation, data are sent to subdecoder 2
- 3. Subdecoder 2 starts block processing...
- 4. Iterate until stopping crtierion fullfilled
- Information exchange takes place "randomly"
  - Tanner graph (Parity check matrix)/ Interleaver
  - Quality of "randomness" influences communcations performance
- High Throughput/Low Latency achitectures
  - Interconnect centric architectures

# **Solving Interconnect Bottleneck**

Crossbar functionality with output blocking conflict (MPSoC'04)

#### Conflict free by code design

- Solves the interconnect problem at the "system level"
- Tanner Graph/Interleaver is designed according to a fixed architectural template with regular interconnect topology e.g. barrelshifter
- · Architecture imposes constraints on the code
  - Impact on communications performance

#### Run-time conflict resolution (MPSoC'04)

- Largest flexibility, no impact on communications performance
- NoC approach

## **Algorithm Design Space**

- Turbo-Decoder UMTS compliant, 166MHz, 180nm technology, Throughput 100 Mbit
  - NoC approach, large flexibilty
  - 14 parallel units, area = 16.84 mm<sup>2</sup> (14mm<sup>2</sup> PUs, 2.8mm<sup>2</sup> NoC)
  - Conflict free interleaver design, limited flexibility
  - 10 parallel units, area=11.23 mm<sup>2</sup> (10,3mm<sup>2</sup> PUs, 0.9mm<sup>2</sup> Barrelshifter)
- Implementing LDPC Decoding on Network-On-Chip, T. Theocharides, G. Link, N. Vijaykrishnan, M. J. Irwin, Int. Conference on VLSI Design 2005
  - 1024 Bit block size, 1.2Gb/s (?), R=0.75
  - NoC: 5x5 2D mesh, dimension-order routing, large flexibility
  - 160nm CMOS Technology, 1.8V, synthesis, 500 MHz, 110 mm<sup>2</sup>, ~30 Watt
- A. Blanksby, C. Howland, IEEE Journal on Solid-States Circuits
  - 1024 Bit block size, 1 Gb/s, Rate=0.5
  - full hardwired interconnect network, no flexibility at all
  - 160nm CMOS Technology, 1.5 V, synthesis, 64 MHz, 52.5mm², ~700mW

## **DVB-S2 LDPC implementation**

- Blocksize 64800 Bit, R=1/4..9/10, 0.7 dB to Shannon limit
- Synthesis results for the DVB-S2 LDPC code decoder

| Area [mm²]<br><del>(0.13um, 270MHz)</del>                                   |                                            | Area<br>[mm²]           |
|-----------------------------------------------------------------------------|--------------------------------------------|-------------------------|
| RAMs                                                                        | Channel LLR  Messages  Addresses/Shuffling | 1.997<br>9.117<br>0.075 |
| Logic                                                                       | Functional Nodes                           | 10.8                    |
| Control logic Shuffling Network Total Area [mm²] Throughput[Mbit/s] @30iter |                                            | 0.55<br>22.74<br>255    |

### Why is this implementation so efficient?

- Code was defined with implementation complexity in mind
- IRA codes: subset of LDPC codes
- Limited flexibility (11 Tannergraphs)
- Permutation network
  - Shuffling network with some "tricky" memory allocation/assignments



### **Conclusion**

The Gap can be closed

- Application/Architecture Codesign
- Match the architecture & application
- "Just enough flexibility"
  - Application
  - Implementation

Thank you for listening!

For further information please visit <a href="http://www.eit.uni-kl.de/wehn">http://www.eit.uni-kl.de/wehn</a>