Dinesh Pamunuwa\*, Matt Grange\*\*, Axel Jantsch+, Sunil Rana\*, Tyson Tian Qin\* \*University of Bristol, Bristol UK \*\*Mentor Graphics, USA \*KTH Royal Institute of Technology, Stockholm, Sweden

# System Performance Analysis for Heterogeneous 3-D ICs and Emerging Technologies



## **Future Direction of 3-D ICs**

3-D ICs promise performance / cost advantages for high performance digital applications as well as enhanced functionality

□ A staggering amount of degrees of freedom exist

- Packages (2-D, Multi-chip Modules, 3-D die stacks)
- Digital architecture (single to multi to many-core)
- Routing architectures (buses, Networks-on-Chip, hybrids)
- System organization (Memory hierarchy)
- Technology (CMOS, MEMS/NEMS, feature size reduction)

□ Framework for comparison



# **Intrinsic Computational Efficiency**

- The Intrinsic Computational Efficiency (ICE), proposed by T. Claasen, creates the maximum upper bound for computational capability of a silicon-based processor.
  - The entire silicon area of a processor is filled with the most fundamental computational unit, in this case we have used 32-bit adders.
  - A real system could never achieve the same performance per Watt because this metric ignores the overhead of control circuitry, interconnect, and memory.





# Expanding the ICE to the ECE

- □ The *ICE* gives the *maximum upper bound* on efficiency, but cannot account for realistic systems because it only considers the computational unit.
- □ We build upon the *ICE* by modelling the three fundamental operations of any processing unit:



DP, D43D Jun 2013

### **Temporal and Spatial Organization of Memory**

⊒µ<sub>s</sub>

 Gives us the amount of memory per operator. Think of it as the amount of on-chip cache available



□ μ<sub>T</sub>

Gives us the number of memory reads/writes per operation.

2 reads, 1 write  
$$\mu_T = 3$$

DP, D43D Jun 2013





## **On- or Off-chip Memory?**

- 🛛 ω (0-1)
  - Gives the ratio of on- to off-chip memory in the system.
    Off-chip memory requires exiting the die with I/O drivers to external chips.





DP, D43D Jun 2013

## **Memory Distribution Factor**

- Δ (0-1)
  - Gives the distribution factor of the memory, or how close (local) it is the operator.



**0:5abe**milocal



## **Effective Computational Efficiency**

For each computation we must consider the expense of energy for the operation, interconnect (on and off-chip) and memory reads/writes.



## **Modelling Hierarchy**



### **Model Development**



### **Model Verification**

11



## **Stand-alone Parasitic Estimation Tool**







DP, D43D Jun 2013

## **Modelling Hierarchy**



DP, D43D Jun 2013

### **Thermal Behaviour**

Compact thermal models have been developed as part of the toolset to quickly predict the thermal behaviour.

- Verification based on comparison with results from a Computational Fluid Dynamic (CFD) solver (FIoTHERM).
- The thermal behaviour and limitations of 2-D and 3-D packages has been extracted from simulations.



### **Stand-alone 3-D Thermal Tool**



DP. D43D Jun 2013

## **Modelling Hierarchy**



DP, D43D Jun 2013

### Scaling 2-D versus 3-D DRAM

#### 2-D with off-chip DRAM

#### 2-layer 3-D with in-stack DRAM





DP, D43D Jun 2013

# **Fixed System: Intel 80 Core**

20

| Param. | Туре                                  | Intel 80 Core     | Equivalent    |             |                         |            |
|--------|---------------------------------------|-------------------|---------------|-------------|-------------------------|------------|
| Ν      | # of Layers                           | 1 (2-D)           | 1-16 (3-D)    |             | implemented pro         | cessors    |
| Α      | Die Area                              | 12.64×21.72 mm    | 275 mm2/N     |             | can be modelled         | hyvorving  |
| tn     | Tech. node                            | 65 nm             | 180-17 nm     |             | can be modelled         | by varying |
| b      | Data width                            | 32-bit            | 32            |             | the narameters          |            |
| μs     | Memory/Operator                       | 2K SRAM/2 FPU     | 1KB/Op        |             | the parameters          |            |
| μt     | Memory/Operation                      | App. Specific     | 01-Mar        |             |                         |            |
| σ      | Bus Sharing Ratio                     | 8x10 mesh/160 FPU | 18/160=0.11   |             | Effective Computational | Efficiency |
| Δ      | Memory Distribution                   | NoC Mesh          | 0.01-0.1      | <b>70</b> [ |                         |            |
| ω      | On/off-chip mem                       | All on-chip       | 1             |             |                         | •          |
| Р      | Power (W)                             | 20-230            | App. Specific | 60          | 3-D4 model              | $\varphi$  |
|        | · · · · · · · · · · · · · · · · · · · |                   |               | - 00        |                         |            |

The 4-layer 80-core 3-D system at 90 nm is still better than a 2-D system at 65 nm

For every doubling of the stack height the computational efficiency increases by 20-30%

180 100 160 140 120 80 60 Technology Node (nm) C University of

A 4-layer partitioned

Intel 80 Core @ 90 nm achieves similar GOPS/Watt

as 2-D implementation in 65 nm

Intel 80 Core @ 65 nm

20

40

DP, D43D Jun 2013

microelectronics research group

50

40

30

20

10

**GOPS/Watt** 

## **Modelling Hierarchy**



DP, D43D Jun 2013

## **NEM Relay Based Computation**

### Limitation of CMOS Energy Efficiency



### □ NEM relay advantages

- practically zero leakage
- Very steep slope for turn-on/-off transient
- high on-current

DP, D43D Jun 2013



## **NEM Relay Technology**



In-plane switch, fabricated using standard lithography in NEMIAC project; nm gap using sacrificial layer courtesy D. Grogg et al. IBM Research Zurich<sup>1</sup>



### **NEM Relay Modelling**

### □ FEA to Reduced Order Model to Circuit Model



University of BRISTOL

DP, D43D Jun 2013

# **ICE with NEM Logic**

#### NEM "tech node" is much larger than CMOS

- Devices fabricated at 17 µm and 5 µm cantilever length
- Miniaturisation increases speed and reduces energy
  - No straightforward analogue to scaling for CMOS

### Trajectory is promising

 Ultra-low power technology for low latency, low throughput applications

Intrinsic Computational Efficiency (ICE) 10 CE-T. Claasen ICE-This work 10<sup>6</sup> MOPS / Watt 105 EM 32 bit (approx) + Tilera 100 + Tilera 36 10 EM 32 bit (approx) 10<sup>3</sup>  $10^{2}$ 10<sup>4</sup>  $10^{3}$ 10 Technology Node (nm)

Ripple-carry architecture and worst-case energy

- Device models at 17µm and 5µm silicon qualified
- 4 and 8- bit adder energy at 17 µm based on accurate circuit model
- 32-bit adders based on scaling



DP, D43D Jun 2013

NEMIAC project
 EU FP7 Strep: Grant No. 288670





### **ELITE Project**

- EU FP7 Strep: Grant No. 215030





26

DP, D43D Jun 2013