

# **Efficient Implementations of Deep Neural Network Hardware**

#### **Takashi Miyamori**, General Manager Center for Semiconductor Research & Development

#### **Toshiba Electronic Devices & Storage Corporation**

© 2017 Toshiba Corporation

### **Deep Learning Everywhere**



### **Semantic Segmentation**

Classify objects in each pixel



\*) Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: A deep convolutional encoder-decoder architecture for image segmentation. arXiv preprint arXiv:1511.00561 (2015)

### **Road Detection by DNN**





- Technology Trends
- LOGNET: energy-efficient neural networks using logarithmic computation (Stanford Univ. & Toshiba) [ICASSP 2017]
- TDNN: Time-Domain Neural Network (Toshiba) [A-SSCC 2016]



# **Efficient DNN Implementations**

#### Improvement of Network Models

- GoogLeNet, ResNet
- Reduction of Parameters(# of data, bit width) and Compression
  - Deep Compression (Stanford): Pruning, Quantization, Huffman Coding
  - Binarized Neural Networks (Univ. of Montreal)



### **Improvement of Network Models**





| Model              | AlexNet         | VGG          | GoogLeNet | ResNet                     |
|--------------------|-----------------|--------------|-----------|----------------------------|
| Organization       | Univ. of Tronto | Oxford Univ. | Google    | Microsoft<br>Research Asia |
| Year               | 2012            | 2014         | 2014      | 2015                       |
| ILSVRC* Error Rate | 15.30%          | 7.33%        | 6.66%     | 3.57%                      |
| # of Layers        | 8               | 19           | 22        | 152                        |
| # of Parameters[M] | 62.4            | 144          | 7.0       | 56.0                       |
| # of Operations[B] | 1.14            | 19.6         | 1.5       | 11.3                       |

\*) ILSVRC: ImageNet Large Scale Visual Recognition Challenge



Based on slides of Dr. Momose, Hokkaido Univ. https://www.semiconportal.com/archive/contribution/applications/160804-neurochip2-2.html

7

### **Road Scene Semantic Segmentation**



\*) Training constrained deconvolutional networks for road scene semantic segmentation G Ros, S Stent, PF Alcantarilla, T Watanabe, arXiv preprint, arXiv:1604.01545 (2016)



## **Deep Compression (Stanford)**



Song Han, Huizi Mao, and William J. Dally, "Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding," arXiv preprint arXiv:1510.00149, 2015.

Song Han, Jeff Pool, John Tran, and William Dally, "Learning both weights and connections for efficient neural network," in Proceedings of Advances in Neural Information Processing Systems 28 (NIPS2015), 2015, pp. 1135-1143.



### **Binarized Neural Networks (Univ. of Montreal)**

 Neural networks with binary weights and activations (+1/-1) except for the first and the last layers



I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio,

"Binarized neural networks," Advances in Neural Information Processing Systems 29, 2016, pp. 4107-4115...

## Outline

Technology Trends

- LOGNET: energy-efficient neural networks using logarithmic computation (Stanford Univ. & Toshiba) [ICASSP 2017] [arXiv:1603.01025]
- TDNN: Time-Domain Neural Network (Toshiba) [A-SSCC 2016]



# **Motivation of LOGNET**

## To realize energy-efficient neural networks

- Data representation with fewer bits
- Eliminate multiplications



### **Evaluation of Proposed 1**



Top5 accuracy vs Full scale range: AlexNet

Top5 accuracy vs Full scale range: VGG16



# **Evaluation of Proposed 2**



 Top-5 accuracies after linear and log2 encoding on all layers' weight without retraining

| Model   | Float 32b | Lin. 4b | $\log_2 4b$ | Lin. 5b | $\log_2 5b$ |
|---------|-----------|---------|-------------|---------|-------------|
| AlexNet | 78.3%     | 1.6%    | 73.4%       | 71.0%   | 74.6%       |
| VGG16   | 89.8%     | 0.5%    | 85.2%       | 83.2%   | 86.0%       |



# **Training with Logarithmic Representation**

#### Training Algorithm



CIFAR10 database

VGG-like network



Enables end-to-end training using logarithmic representation at 5b level

## Outline

- Technology Trends
- LOGNET: energy-efficient neural networks using logarithmic computation (Stanford Univ. & Toshiba) [ICASSP 2017]
- TDNN: Time-Domain Neural Network (Toshiba) [A-SSCC 2016]



## Why the brain is so energy efficient?



Weight is built into each synapse. Don't need to move weight at all. Power efficient!! Load weight from memory for EVERY calculation. Power hungry!!

| Operation          | Relative Cost (Energy) |  |  |
|--------------------|------------------------|--|--|
| 32 bit int ADD     | 1                      |  |  |
| 32 bit int MULT    | 31                     |  |  |
| 32 Register File   | 10                     |  |  |
| 32 bit SRAM        | 50                     |  |  |
| 32 bit DRAM Memory | 6400                   |  |  |



### How about hardware efficiency?

e.g. The number of weights  $\rightarrow$  100x



Need to have 100x processing elements. Because each processing element (PE) is dedicated to each weight. Need to minimize each PE!!

### **Our strategy**

- In order to maximize the energy efficiency, we propose to employ fully spatially unrolled architecture (like the brain).
- In order to minimize the hardware size, we propose to employ Time Domain Analog and digital Mixed Signal processing (TDAMS) [11].





# **SIGN(W1X1 + W2X2)**

# (W1×X1'+W2)×X2' = W1X1'X2' +W2X2'

X\_i' = XOR(X\_i, X\_i+1)



### **TDAMS** – Convolution

## **TDAMS - ADD**



Leading Innovation >>>

### **TDAMS - Multiplication**





### **TDAMS** ~ Activation (SIGN)







# **SIGN(W1X1 + W2X2)**

# $(W1 \times X1' + W2) \times X2' = W1X1'X2' + W2X2'$

X\_i' = XOR(X\_i, X\_i+1)



### **TDAMS** – Convolution

# Chip photograph



### 65nm CMOS technology # of processing elements: 32768



### **Experimental results**



### **Experimental results**



# **Experimental results**



### **Performance comparison**

| Energy efficiency is 10x better than ISSCC 2016[3].                                                                                                                                                               |                          |                          |                     |                                            |                          |  |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------|--------------------------|---------------------|--------------------------------------------|--------------------------|--|
|                                                                                                                                                                                                                   | Blueprint<br>w/ ReRAM    | Test chip<br>w/ SRAM     | GLS-VLSI<br>2015[1] | Science<br>[2]                             | ISSCC2016<br>[3]         |  |
| Tech.[nm]                                                                                                                                                                                                         | 65                       | 65                       | 65                  | 28                                         | 40                       |  |
| Chip area<br>[mm <sup>2</sup> ]                                                                                                                                                                                   | -                        | 3.61ª                    | 1.31ª               | 430                                        | 0.012                    |  |
| Energy<br>efficiency<br>[TSOp/s/W]                                                                                                                                                                                | <b>48.2</b> <sup>b</sup> | <b>48.2</b> <sup>b</sup> | 0.402               | 0.039 <sup>[6]</sup><br>0.4 <sup>[7]</sup> | <b>3.86</b> <sup>c</sup> |  |
| Hardware<br>efficiency <sup>d</sup><br>[GE/PE]                                                                                                                                                                    | 3                        | 76.5                     | 4641 <sup>a</sup>   | 6.5                                        | 288                      |  |
| <sup>a.</sup> core area including SRAM, <sup>b.</sup> excludes external I/O, <sup>c.</sup> excludes CML<br><sup>d.</sup> 1GE:1.44um <sup>2</sup> (65nm), 0.65 um <sup>2</sup> (40nm), 0.49 um <sup>2</sup> (28nm) |                          |                          |                     |                                            |                          |  |

 L. Cavigelli and L. Benini, "Origami: A 803 gop/s/w convolutional network accelerator," arXiv preprint arXiv: 1512.04295, 2015
P. A. Merolla, et al., "A million spiking-neuron integrated circuit with a scalable communication network and interface," Science, vol. 345, no.6197, pp. 668-673, 2014.
E. H. Lee and S. S. Wong, "A 2.5ghz 7.7tops/w switched-capacitor matrix multiplier with co-designed local memory in 40nm," in ISSCC Dig. Tech. Papers, pp. 418-419, 2016.



## **Blue print with ReRAM**



= **3** 2-input NAND

+ memory cell (e.g. ReRAM)

### $1.5 \,\mu m^2 @28nm = 230M PEs / 4 cm^2$ cf. ResNet\*: 230M parameters

Leading Innovation >>>

\*) K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," arXiv preprint arXiv: 1512.03385, 2015.

1bit

### Summary

 Efficient Implementations of DNN are required for embedded systems and edge devices

#### Efficient Implementations

- Improvement of Network Models.
  - Simple Network Models (e.g. GoogLeNet, ResNet)
- Reduction of Parameters (# of data, bit width) and Compression
  - Deep Compression (Stanford): Pruning, Quantization, Huffman Coding
  - Binarized Neural Networks

### Efficient Hardware Implementations

- LOGNET: energy-efficient neural networks using logarithmic computation
- TDNN(Time Domain Neural Network)
  - Fully spatially unrolled architecture (like the brain).
  - Time Domain Analog and digital Mixed Signal processing (TDAMS)

# **TOSHIBA** Leading Innovation >>>