

### **Processor Solutions for Energy-Efficient IoT Applications**

Pieter van der Wolf

MPSoC 2017 July 2 – 7, 2017

(a, b) (a, b)



#### IoT applications

Sensor processing and voice / audio

Wireless connectivity

Security

Conclusion



### From the Edge to the Cloud



**IoT Edge Devices** (Smart Devices)



"Things" with sensors & actuators that monitor and control

Aggregation Layers (Hubs/Gateways)



Connectivity & Interfaces to aggregate the edge data to send to the cloud

#### Remote Processing (Cloud Based)



Applications to analyze the data and offer cloud services



### **IoT Edge Device Market**



#### Internet of Things and its Attractive Growth...

- 5B people connected by 2020<sup>1</sup>
- 11.5% CAGR through 2022 for IoT Chip Market (\$4.6 -> \$10.8B)
- 50%+ Volume driven by Smart Home and Wearables<sup>3</sup>
- 55% Global IoT security market growth through 2019<sup>4</sup>
- \$7.4B and over 887 deals to IoT startups since  $2010^5$

#### Fragmented Market..... Key Applications Drive Innovation

- Mobile handsets drive interoperability (WiFi, Cellular, Bluetooth)
- Regulations & Standards drive security
- Wearables drive energy efficiency
- Drive for low-cost in high-volume markets

Source<sup>1</sup>: World Economic Forum Source<sup>2</sup>: Market and Markets Source<sup>3</sup>: Linley 2016 IoT Report Source<sup>4</sup>: Marketresearchreports.biz Source<sup>5</sup>: CB Insights



#### Billions of "Things" in use\*

#### 2021: 2.2 billion units

Figure 4-1. IoT unit share by market segment, 2021. We project the consumer-driven smart home and wearables segments to be the largest contributors. (Source: The Linley Group)

### **Example IoT SoC Architectures**



Corral the Market Fragmentation

#### **Smart Analog Device**



- Bare Metal
- 180nm some 130/90nm



#### Low-End Edge Device



- RTOS: FreeRTOS, Zephyr, Rocket, Contiki....
- Integrated Radio
- 90nm → 55nm & 40nm (0.9v)





- Linux, Android
- 65nm → to 28nm-16nm



#### **SYNOPSYS**<sup>®</sup>

### **IoT Innovation Driving Design Requirements**



|                                                              |                                      |                                                               | *                                                                          |
|--------------------------------------------------------------|--------------------------------------|---------------------------------------------------------------|----------------------------------------------------------------------------|
| Sensor<br>Processing                                         | Wireless<br>Connectivity             | Security                                                      | Energy<br>Efficiency                                                       |
| How much processing is needed to balance cost & performance? | What de-facto standards will emerge? | Pervasive security needed<br>but what exactly is<br>required? | Add processing,<br>connectivity & security<br>while extending battery life |

### Successful strategy for IoT needs to address all of these requirements





### IoT applications

#### Sensor processing and voice / audio

Wireless connectivity

Security

Conclusion



### **Wearable Sensors**

Future Sensors Market Estimations

• "The market for wearable sensors will reach \$6.1bn by 2026"



SOURCE: The IDTechEx Research report - "Wearable Sensors 2016-2026: Market Forecasts, Technologies, Players"



For more information, see the IDTechEx Research report: Wearable Sensors 2015-2025: Market Forecasts, Technologies, Players (www.IDTechEx.com/WTSensors)





### Low-power DSP for IoT

Key processor features

- Energy efficiency
  - Low power consumption  $\rightarrow$  low  $\mu$ W/MHz
  - -High cycle efficiency  $\rightarrow$  low MHz
- Key processor features
  - Processor architecture / ISA
    - For low power consumption and high cycle efficiency for targeted application
    - For small code size
  - -Memory architecture
    - For reducing accesses to instruction and data memories
  - Configurability
    - Allow hardware features to be (de-)configured
    - For best balance of efficiency and area / power
  - Extensibility
    - Allow extension with customer-specific instructions
    - For high cycle efficiency







# XY for higher DSP performance



And lower I-memory power

#### C source code

```
q31_t foo(__xy q31_t *b, __xy q31_t *c) {
  q31 t s = 0;
   for (i = 0; i < N; i++)
     s += b[i] * c[i]
   return s;
```





|                         | Non-XY | XY  |
|-------------------------|--------|-----|
| Performance [MAC/cycle] | 0.3    | 1.0 |
| I-power [B/MAC]         | 12     | 4   |



- Higher DSP performance
- Lower I-memory power
- Smaller code size

### XY and multi-issue architectures

Code size and I-memory power



#### C source code

```
q31_t foo(__xy q31_t *b, __xy q31_t *c) {
    q31 t s = 0;
```

```
for (i = 0; i < N; i++)
```

```
s += b[i] * c[i]
```

```
return s;
```

#### XY assembly

```
// prologue, set-up address gen
SR ...
SR ...
LP lpend
MAC 0, %agu_u0, %agu_u1
```

```
lpend: // no epilogue
```

#### Lower I-memory power

• Smaller code size

|                         | Non-XY | XY  | Multi-<br>issue |
|-------------------------|--------|-----|-----------------|
| Performance [MAC/cycle] | 0.3    | 1.0 | 1.0             |
| I-power [B/MAC]         | 12     | 4   | 8               |

#### Multi-issue assembly

...

```
// Prologue SW pipelined
LDD %r2, [%r0, 8] ; 64b vector load
LDD %r4, [%r1, 8] ; 64b vector load
LDD %r6, [%r0, 8] ; 64b vector load
// 4x unrolled loop
LP lpend
{MAC 0, %r2, %r4; LDD %r8, [%r0, 8]} ; 32b MAC and 64b load
{MAC 0, %r3, %r5; LDD %r2, [%r1, 8]} ; 32b MAC and 64b load
{MAC 0, %r6, %r8; LDD %r4, [%r0, 8]} ; 32b MAC and 64b load
{MAC 0, %r7, %r9; LDD %r6, [%r1, 8]} ; 32b MAC and 64b load
{MAC 0, %r7, %r9; LDD %r6, [%r1, 8]} ; 32b MAC and 64b load
{MAC 0, %r7, %r9; LDD %r6, [%r1, 8]} ; 32b MAC and 64b load
{MAC 0, %r7, %r9; LDD %r6, [%r1, 8]} ; 32b MAC and 64b load
```

#### **SYNOPSYS**°

### **Reducing memory accesses**

Wide memories for reducing I&D memory power

- Wide instruction memory
  - Fetch multiple instructions at once  $\rightarrow$  no need to fetch instruction every cycle
  - -Widening I-memory by 2x reduces I-memory power by 30-40% for many applications
- Wide data memory
  - -For example, combine 32-bit math with 64-bit LD/ST
  - Compiler performs LD/ST widening



### • LD/ST widening

- Loop unrolling
- Vectorization of LD/ST
- To use 64-bit LDD/STD
- Higher performance
  - Reduces cycle count
- Fewer memory accesses
  - Reduces memory power





### **Market Update**



Rapid Adoption of Natural Voice Based Human/Machine Interfaces (HMI)





© 2017 Synopsys, Inc. 13



# **Voice Activation**

Sensory TrulyHandsfree<sup>™</sup> on ARC EM5D

Example of always-on application: always listening



| ARC EM5D<br>TSMC 28nm HPM process                                                 | Frequency<br>Requirement | Power<br>Consumption |
|-----------------------------------------------------------------------------------|--------------------------|----------------------|
| <b>Detection Mode</b><br>Sensory LPSD function, 16 kHz microphone input           | 0.26 MHz                 | 0.9 µW*              |
| <b>Recognition Mode</b><br>100% (full cycle) recognition, 16 kHz microphone input | 7.6 MHz                  | 40 µW*               |

\* Logic dynamic power, gate-level simulation with post-layout RC



Synopsys

sensory

Truly<sup>™</sup> Handsfree<sup>™</sup>



### IoT applications

Sensor processing and voice / audio

Wireless connectivity

Security

Conclusion



### **Communications for the Internet of Things**



Many applications and protocols





### **Wireless Market Trends**

- 5G cellular (ITU IMT-2020)
  - Support >1 Gbps data rates for mobile devices
  - Support large numbers of connected devices
  - Standard to be finalized
  - Huge investment in know-how, hardware (processor + accelerators) and software
- Low Power Wide Area Networks
  - Sigfox: 1<sup>st</sup> mover, adopted by Samsung, Atmel (ATA8520)
  - LoRa Alliance: Semtech developed radio, supported by 11 cellular providers, Cisco, IBM, Microchip, FSL
  - Weightless: Huawei acquired Neul

### LTE-M and NB-IoT

- Low data rate (<1Mbps) / low power / low cost cellular IoT</li>
- Builds on installed cellular network infrastructure
- Supports long-range communication and mobility
- Automotive, utility meters, tele-health, tracking, vending machines, etc.



LoRa Alliance



WEIGHTLESS"





### **Communications for the Internet of Things**



Many applications with broad range of data rates and performance requirements





# LTE for IoT





| Feature            | Cat. 1 (Rel. 8+)                     | Cat. M1 (Rel. 13)                                 | Cat. NB1 (Rel. 13)                                | eNB-IOT (Rel. 14)                              |
|--------------------|--------------------------------------|---------------------------------------------------|---------------------------------------------------|------------------------------------------------|
| Bandwidth          | 20 MHz                               | 1.4 MHz                                           | 180 kHz                                           | 180 kHz                                        |
| Deployments        | LTE channel                          | Standalone, in LTE channel                        | Standalone, in LTE channel,<br>LTE guard bands    | Standalone, in LTE channel,<br>LTE guard bands |
| Full / half duplex | Full duplex (no HD-FDD)              | HD-FDD preferred                                  | HD-FDD                                            | HD-FDD preferred                               |
| Data rates (peak)  | DL: 10 Mbps,<br>UL: 5 Mbps           | ~800 kbps (FD-FDD)<br>300/375 kbps DL/UL (HD-FDD) | Less that 100 kbps                                | Higher than Cat. NB1                           |
| Latency            | < 1s                                 | ~ 5s                                              | <10s                                              | At least the same as Cat. NB1                  |
| Coverage           | Standard LTE coverage                | Improved coverage                                 | Deep coverage to work in cellars                  | Deep coverage to work in cellars               |
| Mobility           | Seamless                             | Seamless                                          | Connections get dropped on<br>base station switch | More mobility than Cat. NB1                    |
| Voice              | Yes                                  | Yes                                               | No                                                | No                                             |
| FEC                | Turbo (DL + UL)<br>Viterbi (DL + UL) | Turbo (DL + UL)<br>Viterbi (DL + UL)              | Viterbi (DL)<br>Turbo / repetition (UL)           |                                                |
| Encryption         | AES-128, SNOW 3G                     | AES-128, SNOW 3G                                  | AES-128, SNOW 3G                                  |                                                |
| Power saving       | DRX                                  | eDRX, PSM                                         | eDRX, PSM                                         | eDRX, PSM                                      |



### LTE power saving features



- Extended Discontinuous Reception (eDRX)
  - DRX sleep period of up to 10.24s
  - eDRX allows device to sleep for multiple periods of 10.24s
  - Cat.M1: up to ~40 minutes of extended sleep
  - Cat.NB1: up to ~4 hours of extended sleep

- Power Saving Mode (PSM)
  - User Equipment (UE) can decide to go dormant indefinitely
  - When UE decides to wake up it transmits to the network
  - UE remains in RX mode for 4 idle frames to receive a reply



# **LTE-NB modem architecture**



#### RISC+DSP based SoC with HW accelerators

SoC architecture for wide band cellular modem



#### ARC EMxD with extensions

SoC architecture for LTE-NB software modem



| Hardware Acceleration                             | Pure Software                                                 |
|---------------------------------------------------|---------------------------------------------------------------|
| Longer development cycle -> slower time to market | Shorter development cycle -> faster time to market            |
| Limited flexibility for changing standards        | High flexibility for changing standards                       |
| Data transfer overhead, complex synchronization   | Simple software synchronization, optimal data exchange        |
| Limited multi standard support                    | Software standard implementation – can be switched on the fly |

### Viterbi processor extension

### Code example decoder

void viterbi decode( xy int32 vec3x8b sample[], xy int32 result[], int frame) { // sample: array of 3\*8b components of input sample // result: decoded result // frame: number of bits in the frame (multiple of 32) // store path metric decision bits xy int32 decisions[frame\*2]; // infer XY address generation // path metric reset, single cycle vitrst(); // compute path metrics and decision bits, two cycles per bit for (i = 0; i < frame; i++) {</pre> 721924: decisions[2\*i] = vitacc0(sample[i]); decisions[2\*i+1] = vitacc1(0); 1 // traceback, one cycle per bit i = 0; j = 0;while (i < frame) {</pre> for (k = 0; k < 31; k++) { vittb(decisions[2\*i], decisions[2\*i+1]); i++; } result[j++] = vittb(decisions[2\*i], decisions[2\*i+1]); i++;



| 721925: | 000002a4 | 382f4840 vitacc0               | agu_u0,agu_u1          |
|---------|----------|--------------------------------|------------------------|
| 721926: | 000002a8 | 386f4001  <mark>vitacc1</mark> | agu u0,0               |
| 721927: | 000002a4 | 382f4840 vitacc0               | agu u0,agu u1          |
| 721928: | 000002a8 | 386f4001  <mark>vitacc1</mark> | agu u0,0               |
|         |          |                                | —                      |
|         |          |                                |                        |
|         |          |                                |                        |
| •••     |          |                                |                        |
| 928719: | 000002ec | 3900483e vittb                 | 0,agu_u1,agu_u0        |
| 928720: | 000002f0 | 3900483e vittb                 | 0,agu_u1,agu_u0        |
| 928721: | 000002f4 | 3900483e  <mark>vittb</mark>   | 0,agu u1,agu u0        |
| 928722: | 000002f8 | 3900483e vittb                 | 0,agu u1,agu u0        |
| 928723: | 000002fc | 3900483e vittb                 | 0,agu u1,agu u0        |
| 928724: | 00000300 | 3900483e vittb                 | 0,agu u1,agu u0        |
| 928725: | 00000304 | 3900483e vittb                 | 0,agu u1,agu u0        |
| 928726: | 00000308 | 39004822 vittb                 | agu u2, agu u1, agu u0 |
| 928727: | 000002ec | 3900483e vittb                 | 0,agu_u1,agu_u0        |
|         |          |                                |                        |
|         |          |                                |                        |

000002a8 386f4001|vitacc1 agu u0,0

### Viterbi processor extension

### MHz requirements for LTE-NB

| Viterbi decoding                      |    |
|---------------------------------------|----|
| Path metrics [cycles/bit]             | 2  |
| Traceback [cycles/bit]                | 1  |
| Total [cycles/bit]                    | 3  |
| Overhead for tailbiting [%]           | 50 |
| Overhead per frame [cycles/frame]     | 10 |
| Total including overhead [cycles/bit] | ~5 |

Key features Viterbi processor extension

- Supports constraint lengths K in range [5, 12]
- Supports base coding rates  $\frac{1}{2}$ ,  $\frac{1}{3}$ ,  $\frac{1}{4}$ ; other rates supported through (de)puncturing
- Supports tail-biting
- No need to have hardware accelerator on shared interconnect
  - With associated overheads of programming accelerators and moving data to/from accelerators
  - Instead have custom instructions as extensions to EMxD core



| Standard | Peak data rate | MHz requirements<br>ARC EMxD |
|----------|----------------|------------------------------|
| Cat.NB1  | 100 kbps       | 0.5 MHz                      |
| Cat.M1   | 800 kbps       | 4 MHz                        |

- Low MHz requirements
- Small code size
- Easy software integration
- No need to move data to/from accelerator





### IoT applications

Sensor processing and voice / audio

Wireless connectivity

#### Security

Conclusion





### Attacks on the Rise & Evolving





### **Embedded Security Requirements for Processors**







### **Trusted Execution Environments**



Ensure separation of secure processes. Various implementations.



**SYNOPSYS**°

### **Cryptography Implementation Options**

Code

Mem.



**Crypto Software** 

### Specialized CPU Instructions

#### **HW Crypto Cores**









### **CryptoPack for ARC EM and SEM Processors**



#### Hardware Extensions to Accelerate Cryptographic Algorithms

- Speeds up software implementations through tested and verified custom instructions
- Also reduces code size
- Area optimized and performance optimized versions
- Support for common crypto algorithms such as AES, 3DES, ECC, SHA-256, and RSA



CryptoPack SHA-256 increases ARC area by 8% but increases performance by 7x and reduces energy by 6.5x



#### IoT applications

Sensor processing and voice / audio

Wireless connectivity

#### Security

### Conclusion



### Conclusions



- Great variety of IoT applications
  - -High growth in many markets
  - -Combine functionalities for sensor processing, connectivity and security
  - Constraints on power and cost will require optimized implementations
- Significant energy savings can be achieved with processors optimized for IoT applications
  - Processors architected for energy efficiency and small code size
  - Memory architecture for reducing accesses to instruction and data memories
  - Configurability for best balance of efficiency and area / power
  - Extensibility for increased cycle efficiency and reduced code size





# Thank You

