

# □ Provide fast access and high bandwidth - see our paper [ISCA 2006] - Intel 80-core TeraFlop chip [ISSCC 2007] 80 CORF 256 (B SEAM per core 4C & bump density 3700 thu-clinor with Cu bumps (Source: Intel)

#### **Business Model has Impacts on** the Decision?





# 3D Stacked Microprocessor: Are We There Yet?

GABRIEL H. LOH Georgia Institute of Technology

YUAN XIE Pennsylvania State University

.....Three-dimensional integration has received considerable attention in the last several years from academic researchers and industry alike. This technology provides multiple layers of devices connected by a high-density, lowtion of different types of devices within

technological leaders from a range of Samsung, Tezzaron, and a few other institutions, including major semiconduc- companies have demonstrated, industry tor companies, government agencies, and industry consortia. (Most respond- memory will become mainstream. In this ents answered our questions on condition of anonymity, and some chose not latency, layer-to-layer interface that can to reply at all due to concerns over confienable integrated circuits with more dedentiality and exposure of proprietary invides much faster and higher density vices per unit area and allow the integra- formation.) Their responses provide a inter-die connections than SiP or PoP. view of where 3D integration technology

has reached the consensus that stacked ogy based on through-silicon-via (TSV)

The first question that many people the same 3D chip stack. Academic and for microprocessors currently stands, are interested in is simply when TSV-

IEEE Micro 07/2010

# **Examples** (1) Intel's Tick-Tock Business Model (2) Memory vs Logic Heat Sink/ Fan Heat Sink/ Fan Core Memory Memory Core

#### Wide-IO DRAM Stacking

- Demonstrate bandwidth benefits of 3D for future Quad High-Definition TV (HDTV) application --- first 3D IC prototyping of H.264 application
- Two logic layers (2.5x5mm²)
  - □ WTOP & WBOTTOM
  - $\hfill \square$  Micro-bump Connection
- Three DRAM layers (12.3x21.8mm²) 256MB
- Chartered I30nm + Tezzeron TSV fabrication



## A "More than Moore" Example



Xilinx® Stacked Silicon Interconnect

"Enable 100x improvement in Die-to-Die Bandwidth Per watt"

"Enable 2-3x Capacity Advantage Over Monolithic Devices"

- Four FPGA dies inside one package
- Record 2 million logic cells
  - ☐ One logic cell = One 4-input LUT + One D-F/F (~500 transistors)
  - ☐ More than I billion transistors in a single package

## Our Prospect: uP+Memory in Package



Stacked Silicon with TSV-based 3D integration

- More and more transistors can be integrated into a single package
- About 100MB-1GB on-package DRAM would be available
- How to use these transistors efficiently?
  - ☐ Multi-core, and many-core?
  - □ Larger cache size or deeper cache hierarchy?
  - ☐ On-package main memory?

### Logic + Memory Integration



- Option I: Use these DRAM stacks as caches
  - ☐ Processor chip cannot hold the tag array
    - Supposing 32 DRAM chips, the cache tags are equivalent to 2.1 DRAM
  - ☐ So, each DRAM chip holds both data and tag arrays
    - Supposing a 16-way associative cache, only 15-way is data, 1-way is tag
  - ☐ It is not energy-efficient to read out all of the I5-way data
    - This means 2x cache access latency, one for tag, and one for data.



#### How to use on-package DRAM efficiently?

■ Option 2: Use on-package DRAM as parts of the main memory



- Add on-package scheduling path
- Add migration controller to move data in and out from on-package DRAMs

10

#### Which One is Better? Preliminary Analysis

#### Last-Level Cache

- Tag array overhead
- Cache access latency2x DRAM access latency
- Diminishing returns on miss rate
- Straightforward hardware control

#### Parts of Main Memory

- Need memory controller support
- Static mapping
  - Result in non-optimal data partitioning
- Dynamic mapping
  - Need data migration

We first only consider the heterogeneous main memory with static mapping



# Dynamic Hetero Main Memory



- Static hetero memory works good when application memory footprint is less than IGB.
- If not, how to approach the performance of the ideal case?
- Solution: Dynamic data migration between on-package and off-package memory regions.

13

#### **Data Migration Effectiveness**



■ After data migration, the average memory access latency is approaching the ideal case.

#### Conclusion

- Silicon Interposer provides a nice way to integrate
   3D DRAM with processor.
- Using on-package DRAMs as last-level caches is not an efficient way in terms of performance
- Heterogeneous main memory (on-package DRAM and off-package DIMM) is promising
- Dynamic migration with on-chip memory controller support is important to the effectiveness of heterogeneous main memory

More details Please see our Supercomputing 2010 paper at http://www.cse.psu.edu/~yuanxie/3d.html