## Tutorial: Networks on Chip

The 1st ACM/IEEE International Symposium on Networks-on-Chip
Princeton, New Jersey, May 6, 2007

- Network Layer Communication Performance in Network-on-Chips A. Jantsch (Royal Institute of Technology, Sweden) 10:00-11:45 am
- Power, Energy and Reliability Issues in NoC R. Marculescu (Carnegie Mellon University, USA) 1:15-3:00 pm
- Tooling, OS Services and Middleware
L. Benini (University of Bologna, Italy)

3:15-5:00 pm

# Network Layer Communication Performance in Network-on-Chips 

Introduction<br>Communication Performance<br>Organizational Structure<br>Interconnection Topologies<br>Trade-offs in Network Topology<br>Routing<br>Quality of Service

Introduction


- Topology: How switches and nodes are connected
- Routing algorithm: determines the route from source to destination
- Switching strategy: how a message traverses the route
- Flow control: Schedules the traversal of the message over time


## Basic Definitions

## Basic Definitions

Message is the basic communication entity.
Flit is the basic flow control unit. A message consists of 1 or many flits.
Phit is the basic unit of the physical layer.

## Basic Definitions

Message is the basic communication entity.
Flit is the basic flow control unit. A message consists of 1 or many flits.
Phit is the basic unit of the physical layer.
Direct network is a network where each switch connects to a node.
Indirect network is a network with switches not connected to any node.

## Basic Definitions

Message is the basic communication entity.
Flit is the basic flow control unit. A message consists of 1 or many flits.
Phit is the basic unit of the physical layer.
Direct network is a network where each switch connects to a node.
Indirect network is a network with switches not connected to any node.
Hop is the basic communication action from node to switch or from switch to switch.

## Basic Definitions

Message is the basic communication entity.
Flit is the basic flow control unit. A message consists of 1 or many flits.
Phit is the basic unit of the physical layer.
Direct network is a network where each switch connects to a node.
Indirect network is a network with switches not connected to any node.
Hop is the basic communication action from node to switch or from switch to switch.
Diameter is the length of the maximum shortest path between any two nodes measured in hops.
Routing distance between two nodes is the number of hops on a route.
Average distance is the average of the routing distance over all pairs of nodes.

## Basic Switching Techniques

Circuit Switching A real or virtual circuit establishes a direct connection between source and destination.

## Basic Switching Techniques

Circuit Switching A real or virtual circuit establishes a direct connection between source and destination.

Packet Switching Each packet of a message is routed independently. The destination address has to be provided with each packet.

## Basic Switching Techniques

Circuit Switching A real or virtual circuit establishes a direct connection between source and destination.

Packet Switching Each packet of a message is routed independently. The destination address has to be provided with each packet.

Store and Forward Packet Switching The entire packet is stored and then forwarded at each switch.

## Basic Switching Techniques

Circuit Switching A real or virtual circuit establishes a direct connection between source and destination.

Packet Switching Each packet of a message is routed independently. The destination address has to be provided with each packet.

Store and Forward Packet Switching The entire packet is stored and then forwarded at each switch.

Cut Through Packet Switching The flits of a packet are pipelined through the network. The packet is not completely buffered in each switch.

## Basic Switching Techniques

Circuit Switching A real or virtual circuit establishes a direct connection between source and destination.

Packet Switching Each packet of a message is routed independently. The destination address has to be provided with each packet.
Store and Forward Packet Switching The entire packet is stored and then forwarded at each switch.

Cut Through Packet Switching The flits of a packet are pipelined through the network. The packet is not completely buffered in each switch.
Virtual Cut Through Packet Switching The entire packet is stored in a switch only when the header flit is blocked due to congestion.

## Basic Switching Techniques

Circuit Switching A real or virtual circuit establishes a direct connection between source and destination.

Packet Switching Each packet of a message is routed independently. The destination address has to be provided with each packet.
Store and Forward Packet Switching The entire packet is stored and then forwarded at each switch.

Cut Through Packet Switching The flits of a packet are pipelined through the network. The packet is not completely buffered in each switch.

Virtual Cut Through Packet Switching The entire packet is stored in a switch only when the header flit is blocked due to congestion.

Wormhole Switching is cut through switching and all flits are blocked on the spot when the header flit is blocked.

## Latency



Time $(n)=$ Admission + RoutingDelay + ContentionDelay
Admission is the time it takes to emit the message into the network.
RoutingDelay is the delay for the route.
ContentionDelay is the delay of a message due to contention.

## Routing Delay

## Store and Forward:

$$
T_{s f}(n, h)=h\left(\frac{n}{b}+\Delta\right)
$$

$n$... message size in bits
$n_{p} \ldots$ size of message fragments in bits
$h$... number of hops
b ... raw bandwidth of the channel
$\Delta$... switching delay per hop

## Routing Delay

Store and Forward:

$$
T_{s f}(n, h)=h\left(\frac{n}{b}+\Delta\right)
$$

## Circuit Switching:

$$
T_{c s}(n, h)=\frac{n}{b}+h \Delta
$$

$n$... message size in bits
$n_{p} \ldots$ size of message fragments in bits
$h \ldots$ number of hops
b ... raw bandwidth of the channel
$\Delta$... switching delay per hop

## Routing Delay

Store and Forward:

$$
T_{s f}(n, h)=h\left(\frac{n}{b}+\Delta\right)
$$

Circuit Switching:

$$
T_{c s}(n, h)=\frac{n}{b}+h \Delta
$$

Cut Through:

$$
T_{c t}(n, h)=\frac{n}{b}+h \Delta
$$

$n$... message size in bits
$n_{p} \ldots$ size of message fragments in bits
$h$... number of hops
b ... raw bandwidth of the channel
$\Delta$... switching delay per hop

## Routing Delay

## Store and Forward:

$$
T_{s f}(n, h)=h\left(\frac{n}{b}+\Delta\right)
$$

$$
T_{c s}(n, h)=\frac{n}{b}+h \Delta
$$

## Cut Through:

$$
T_{c t}(n, h)=\frac{n}{b}+h \Delta
$$

Store and Forward with

$$
T_{s f}\left(n, h, n_{p}\right)=\frac{n-n_{p}}{b}+h\left(\frac{n_{p}}{b}+\Delta\right)
$$ fragmented packets:

$n$... message size in bits
$n_{p} \ldots$ size of message fragments in bits
$h \ldots$ number of hops
b ... raw bandwidth of the channel
$\Delta$... switching delay per hop

## Routing Delay: Store and Forward vs Cut Through




## Routing Delay: Store and Forward vs Cut Through






## Local and Global Bandwidth

$$
\begin{aligned}
\text { Local bandwidth } & =b\left(\frac{n}{n+n_{E}+w \Delta}\right) \\
\text { Total bandwidth } & =C b[\text { bits } / \text { second }]=C w[\text { bits } / \text { cycle }]=C[\text { phits } / \text { cycle }]
\end{aligned}
$$ Bisection bandwidth ... minimum bandwidth to cut the net into two equal parts.

b ... raw bandwidth of a link;
$n$... message size;
$n_{E} \ldots$ size of message envelope;
$w$... link bandwidth per cycle;
$\Delta$... switching time for each switch in cycles;
$w \Delta$... bandwidth lost during switching;
$C$... total number of channels;

For a $k \times k$ mesh with bidirectional channels:

$$
\begin{aligned}
\text { Total bandwidth } & =\left(4 k^{2}-4 k\right) b \\
\text { Bisection bandwidth } & =2 k b
\end{aligned}
$$



## Link and Network Utilization

total load on the network: $L=\frac{N h l}{M}$ [phits/cycle]

$$
\text { load per channel: } \rho=\frac{N h l}{M C}[\text { phits/cycle }] \leq 1
$$

$M$... each host issues a packet every $M$ cycles
$C$... number of channels
$N$... number of nodes
$h$... average routing distance
$l=n / w \ldots$ number of cycles a message occupies a channel
$n$... average message size
w ... bitwidth per channel

## Network Saturation



Offered bandwidth


Delivered bandwidth

Typical saturation points are between $40 \%$ and $70 \%$.
The saturation point depends on

- Traffic pattern
- Stochastic variations in traffic
- Routing algorithm


## Organizational Structure

- Link
- Switch
- Network Interface


## Link

Short link At any time there is only one data word on the link.

Long link Several data words can travel on the link simultaneously.
Narrow link Data and control information is multiplexed on the same wires.

Wide link Data and control information is transmitted in parallel and simultaneously.

Synchronous clocking Both source and destination operate on the same clock.

Asynchronous clocking The clock is encoded in the transmitted data to allow the receiver to sample at the right time instance.

## Switch



## Switch Design Issues

Degree: number of inputs and outputs;

## Buffering

- Input buffers
- Output buffers
- Shared buffers

Routing

- Source routing
- Deterministic routing
- Adaptive routing

Output scheduling
Deadlock handling

## Control flow

## Network Interface

- Admission protocol
- Reception obligations
- Buffering
- Assembling and disassembling of messages
- Routing
- Higher level services and protocols


## Interconnection Topologies

- Fully connected networks
- Linear arrays and rings
- Multidimensional meshes and tori
- Trees
- Butterflies


## Fully Connected Networks



Bus: | switch degree | $=N$ |
| ---: | :--- |
| diameter | $=1$ |
| distance | $=1$ |
| network cost | $=O(N)$ |
| total bandwidth | $=b$ |
| bisection |  |
| bandwidth | $=b$ |

Crossbar: switch degree $=N$
diameter $=1$
distance $=1$
network cost $=O\left(N^{2}\right)$
total bandwidth $=N b$
bisection $=N b$
bandwidth

## Linear Arrays and Rings



## Multidimensional Meshes and Tori


$k$-ary $d$-cubes are $d$-dimensional tori with unidirectional links and $k$ nodes in each dimension:

| number of nodes $N$ | $=k^{d}$ |
| :--- | :--- |
| switch degree | $=d$ |
| diameter | $=d(k-1)$ |
| distance | $\sim d \frac{1}{2}(k-1)$ |
| network cost | $=O(N)$ |
| total bandwidth | $=2 N b$ |
| bisection bandwidth | $=2 k^{(d-1)} b$ |

## Routing Distance in $k$-ary $n$-Cubes



## Projecting High Dimensional Cubes



2-ary 2-cube


2-ary 3-cube


2-ary 4-cube


2-ary 5-cube

## Binary Trees



## $k$-ary Trees



## Binary Tree Projection



- Efficient and regular 2-layout;
- Longest wires in resource width:

$$
l W=2^{\left\lfloor\frac{d-1}{2}\right\rfloor-1}
$$

| $d$ | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
| :---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| $N$ | 4 | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 |
| $l W$ | 0 | 1 | 1 | 2 | 2 | 4 | 4 | 8 | 8 |

## $k$-ary $n$-Cubes versus $k$-ary Trees

$k$-ary $n$-cubes:

| number of nodes $N$ | $=k^{d}$ |
| :--- | :--- |
| switch degree | $=d+2$ |
| diameter | $=d(k-1)$ |
| distance | $\sim d \frac{1}{2}(k-1)$ |
| network cost | $=O(N)$ |
| total bandwidth | $=2 N b$ |
| bisection bandwidth | $=2 k^{(d-1)} b$ |

$k$-ary trees:

$$
\begin{array}{ll}
\text { number of nodes } N & =k^{d} \\
\text { number of switches } & \sim k^{d} \\
\text { switch degree } & =k+1 \\
\text { diameter } & =2 d \\
\text { distance } & \sim d+2 \\
\text { network cost } & =O(N) \\
\text { total bandwidth } & =2 \cdot 2(N-1) b \\
\text { bisection bandwidth } & =k b
\end{array}
$$

## Butterflies



## Butterfly Characteristics



$$
\begin{array}{ll}
\text { number of nodes } N & =2^{d} \\
\text { number of switches } & =2^{d-1} d \\
\text { switch degree } & =2 \\
\text { diameter } & =d+1 \\
\text { distance } & =d+1 \\
\text { network cost } & =O(N d) \\
\text { total bandwidth } & =2^{d} d b \\
\text { bisection bandwidth } & =\frac{N}{2} b
\end{array}
$$

## $k$-ary $n$-Cubes versus $k$-ary Trees vs Butterflies

|  | $k$-ary $n$-cubes | binary tree | butterfly |
| :--- | :---: | :---: | :---: |
| cost | $O(N)$ | $O(N)$ | $O(N \log N)$ |
| distance | $\frac{1}{2} \sqrt[d]{N} \log N$ | $2 \log N$ | $\log N$ |
| links per node | 2 | 2 | $\log N$ |
| bisection | $2 N^{\frac{d-1}{d}}$ | 1 | $\frac{1}{2} N$ |
| frequency limit of <br> random traffic | $1 /\left(\sqrt[d]{\frac{N}{2}}\right)$ | $1 / N$ | $1 / 2$ |

## Problems with Butterflies

- Cost of the network
* $O(N \log N)$
* 2-d layout is more difficult than for binary trees
* Number of long wires grows faster than for trees.
- For each source-destination pair there is only one route.
- Each route blocks many other routes.


## Benes Networks



- Many routes;
- Costly to compute non-blocking routes;
- High probability for non-blocking route by randomly selecting an intermediate node [Leighton, 1992];


## Fat Trees



16-node 2-ary fat-tree

## $k$-ary $n$-dimensional Fat Tree Characteristics



16-node 2 -ary fat-tree

| number of nodes $N$ | $=k^{d}$ |
| :--- | :--- |
| number of switches | $=k^{d-1} d$ |
| switch degree | $=2 k$ |
| diameter | $=2 d$ |
| distance | $\sim d$ |
| network cost | $=O(N d)$ |
| total bandwidth | $=2 k^{d} d b$ |
| bisection bandwidth | $=2 k^{d-1} b$ |

## $k$-ary $n$-Cubes versus $k$-ary $d$-dimensional Fat Trees

$k$-ary $n$-cubes:

| number of nodes $N$ | $=k^{d}$ |
| :--- | :--- |
| switch degree | $=d$ |
| diameter | $=d(k-1)$ |
| distance | $\sim d \frac{1}{2}(k-1)$ |
| network cost | $=O(N)$ |
| total bandwidth | $=2 N b$ |
| bisection bandwidth | $=2 k^{(d-1)} b$ |

$k$-ary $n$-dimensional fat trees:

| number of nodes $N$ | $=k^{d}$ |
| :--- | :--- |
| number of switches | $=k^{d-1} d$ |
| switch degree | $=2 k$ |
| diameter | $=2 d$ |
| distance | $\sim d$ |
| network cost | $=O(N d)$ |
| total bandwidth | $=2 k^{d} d b$ |
| bisection bandwidth | $=2 k^{d-1} b$ |

number of switches $=k^{d-1} d$
diameter $=2 d$
distance $\sim d$
network cost $\quad=O(N d)$
total bandwidth $=2 k^{d} d b$
bisection bandwidth $=2 k^{d-1} b$

## Relation between Fat Tree and Hypercube


binary 2-dim fat tree

binary 1-cube

## Relation between Fat Tree and Hypercube - cont'd


binary 3 -dim fat tree

binary 2 -cube

binary 2 -cube

## Relation between Fat Tree and Hypercube - cont'd


binary 4-dim fat tree

binary 3-cube

binary 3-cube

## Trade-offs in Topology Design for the $k$-ary $n$-Cube

- Unloaded Latency
- Latency under Load


## Network Scaling for Unloaded Latency

Latency $(n)=$ Admission + RoutingDelay + ContentionDelay
RoutingDelay $T_{c t}(n, h)=\frac{n}{b}+h \Delta$
RoutingDistance $h=\frac{1}{2} d(k-1)=\frac{1}{2}(k-1) \log _{k} N=\frac{1}{2}(d \sqrt[d]{N}-1)$



## Unloaded Latency for Small Networks and Local Traffic





## Unloaded Latency under a Free-Wire Cost Model

Free-wire cost model: Wires are free and can be added without penalty.



## Unloaded Latency under a Fixed-Wire Cost Models

Fixed-wire cost model: The number of wires is constant per node:
128 wires per node: $w(d)=\left\lfloor\frac{64}{d}\right\rfloor$.

| $d$ | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
| ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| $w(d)$ | 32 | 21 | 16 | 12 | 10 | 9 | 8 | 7 | 6 |



## Unloaded Latency under a Fixed-Bisection Cost Models

Fixed-bisection cost model: The number of wires across the bisection is constant:
bisection $=1024$ wires: $w(d)=\frac{k}{2}=\frac{\sqrt[d]{N}}{2}$.
Example: $\mathrm{N}=1024$ :

| $d$ | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
| ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| $w(d)$ | 512 | 16 | 5 | 3 | 2 | 2 | 1 | 1 | 1 |




## Unloaded Latency under a Logarithmic Wire Delay Cost Models

Fixed-bisection Logarithmic Wire Delay cost model: The number of wires across the bisection is constant and the delay on wires increases logarithmically with the length [Dally, 1990]: Length of long wires: $l=k^{\frac{n}{2}-1}$

$$
T_{c} \propto 1+\log l=1+\left(\frac{d}{2}-1\right) \log k
$$



## Unloaded Latency under a Linear Wire Delay Cost Models

Fixed-bisection Linear Wire Delay cost model: The number of wires across the bisection is constant and the delay on wires increases linearly with the length [Dally, 1990]:
Length of long wires: $l=k^{\frac{n}{2}-1}$

$$
T_{c} \propto l=k^{\frac{d}{2}-1}
$$




## Latency under Load

## Assumptions [Agarwal, 1991]:

- $k$-ary $n$-cubes
- random traffic
- dimension-order cut-through routing
- unbounded internal buffers (to ignore flow control and deadlock issues)


## Latency under Load - cont'd

Latency $(n)=$ Admission + RoutingDelay + ContentionDelay

$$
\begin{aligned}
T(m, k, d, w, \rho) & =\text { RoutingDelay }+ \text { ContentionDelay } \\
T(m, k, d, w, \rho) & =\frac{m}{w}+d h_{k}(\Delta+W(m, k, d, w, \rho)) \\
W(m, k, d, w, \rho) & =\frac{m}{w} \cdot \frac{\rho}{(1-\rho)} \cdot \frac{h_{k}-1}{h_{k}^{2}} \cdot\left(1+\frac{1}{d}\right) \\
h & =\frac{1}{2} d(k-1)
\end{aligned}
$$

m ... message size
$w$... bitwidth of link
$\rho \cdots$ aggregate channel utilization
$h_{k} \cdots$ average distance in each dimension
$\Delta \ldots$ switching time in cycles

## Latency vs Channel Load



## Routing

Deterministic routing The route is determined solely by source and destination locations.

Arithmetic routing The destination address of the incoming packet is compared with the address of the switch and the packet is routed accordingly. (relative or absolute addresses)

Source based routing The source determines the route and builds a header with one directive for each switch. The switches strip off the top directive.

Table-driven routing Switches have routing tables, which can be configured.

Adaptive routing The route can be adapted by the switches to balance the load.

Minimal routing allows only shortest paths while non-minimal routing allows even longer paths.

## Quality of Service

- Best Effort (BE)
* Optimization of the average case
* Loose or non-existent worst case bounds
* Cost effective use of resources
- Guaranteed Service (GS)
* Maximum delay
* Minimum bandwidth
* Maximum Jitter
$\star$ Requires additional resources


## Regulated Flows

A Flow $F$ is $(\sigma, \rho)$ regulated if

$$
F(b)-F(a) \leq \sigma+\rho(b-a)
$$

for all time intervals $[a, b], 0 \leq a \leq b$ and where $F(t) \cdots$ the cumulative amount of traffic between 0 and $t \geq 0$. $\sigma \geq 0$ is the burstiness constraint; $\rho \geq 0$ is the maximum average rate;

## Regulated Flows

A Flow $F$ is $(\sigma, \rho)$ regulated if

$$
F(b)-F(a) \leq \sigma+\rho(b-a)
$$

for all time intervals $[a, b], 0 \leq a \leq b$ and where $F(t) \cdots$ the cumulative amount of traffic between 0 and $t \geq 0$. $\sigma \geq 0$ is the burstiness constraint; $\rho \geq 0$ is the maximum average rate;


## Regulated Flows - Delay Element



## Regulated Flows - Delay Element



$$
\begin{gathered}
F_{1} \sim(\sigma, \rho) \\
F_{2} \sim(\sigma+\rho D, \rho)
\end{gathered}
$$

## Regulated Flows - Work Conserving Multiplexer



## Regulated Flows - Work Conserving Multiplexer



$$
\begin{aligned}
F_{1} & \sim\left(\sigma_{1}, \rho_{1}\right) \\
F_{2} & \sim\left(\sigma_{2}, \rho_{2}\right) \\
\text { link bandwidth } b & <\rho_{1}+\rho_{2} \\
F_{3} & \sim ? \\
\text { maximum delay } D & =? \\
\text { maximum backlog } B & =?
\end{aligned}
$$

## Work Conserving Multiplexer - 1



## Work Conserving Multiplexer - 1



Phase $1\left(t_{1}\right): F_{1}$ and $F_{2}$ transmit at full speed;

## Work Conserving Multiplexer - 1



Phase $1\left(t_{1}\right): F_{1}$ and $F_{2}$ transmit at full speed;
Assume: At $t=0$ the queue is empty; $\sigma_{1} \leq \sigma_{2}$

## Work Conserving Multiplexer - 1



Phase $1\left(t_{1}\right): F_{1}$ and $F_{2}$ transmit at full speed; Assume: At $t=0$ the queue is empty; $\sigma_{1} \leq \sigma_{2}$ Injection rate: $2 b$; Drain rate: $b$

## Work Conserving Multiplexer - 1



Phase $1\left(t_{1}\right): F_{1}$ and $F_{2}$ transmit at full speed; Assume: At $t=0$ the queue is empty; $\sigma_{1} \leq \sigma_{2}$ Injection rate: $2 b$; Drain rate: $b$

$$
\begin{aligned}
b t_{1} & =\sigma_{1}+\rho_{1} t_{1} \\
t_{1} & =\frac{\sigma_{1}}{b-\rho_{1}}
\end{aligned}
$$

## Work Conserving Multiplexer - 2



## Work Conserving Multiplexer - 2



Phase $2\left(t_{2}\right): F_{1}$ transmits at rate $\rho_{1}, F_{2}$ transmits at full speed;

## Work Conserving Multiplexer - 2



Phase $2\left(t_{2}\right): F_{1}$ transmits at rate $\rho_{1}, F_{2}$ transmits at full speed; Injection rate: $b+\rho_{1}$; Drain rate: $b$

## Work Conserving Multiplexer - 2



Phase $2\left(t_{2}\right): F_{1}$ transmits at rate $\rho_{1}, F_{2}$ transmits at full speed; Injection rate: $b+\rho_{1}$; Drain rate: $b$

$$
\begin{aligned}
b t_{\mathrm{accu}} & =\sigma_{2}+\rho_{2} t_{\mathrm{accu}} \\
t_{\mathrm{accu}} & =\frac{\sigma_{2}}{b-\rho_{2}}
\end{aligned}
$$

## Work Conserving Multiplexer - 3



## Work Conserving Multiplexer - 3



Phase $3\left(t_{\text {drain }}\right): F_{1}$ transmits at rate $\rho_{1}, F_{2}$ transmits at rate $\rho_{2}$;

## Work Conserving Multiplexer - 3



Phase $3\left(t_{\text {drain }}\right): F_{1}$ transmits at rate $\rho_{1}, F_{2}$ transmits at rate $\rho_{2}$; Injection rate: $\rho_{1}+\rho_{2}$; Drain rate: $b$

## Work Conserving Multiplexer - 3



Phase $3\left(t_{\text {drain }}\right): F_{1}$ transmits at rate $\rho_{1}, F_{2}$ transmits at rate $\rho_{2}$; Injection rate: $\rho_{1}+\rho_{2}$; Drain rate: $b$

$$
t_{\text {drain }}=\frac{B_{\max }}{b-\rho_{1}-\rho_{2}}
$$

## Work Conserving Multiplexer - 3



Phase $3\left(t_{\text {drain }}\right): F_{1}$ transmits at rate $\rho_{1}, F_{2}$ transmits at rate $\rho_{2}$; Injection rate: $\rho_{1}+\rho_{2}$; Drain rate: $b$

$$
\begin{aligned}
t_{\text {drain }} & =\frac{B_{\max }}{b-\rho_{1}-\rho_{2}} \\
B_{\max } & =b t_{1}+\rho_{1} t_{2}
\end{aligned}
$$

## Work Conserving Multiplexer - 3



Phase $3\left(t_{\text {drain }}\right): F_{1}$ transmits at rate $\rho_{1}, F_{2}$ transmits at rate $\rho_{2}$; Injection rate: $\rho_{1}+\rho_{2}$; Drain rate: $b$

$$
\begin{aligned}
t_{\text {drain }} & =\frac{B_{\max }}{b-\rho_{1}-\rho_{2}} \\
B_{\max } & =b t_{1}+\rho_{1} t_{2}=\sigma_{1}+\frac{\rho_{1} \sigma_{2}}{b-\rho_{2}}
\end{aligned}
$$

## Work Conserving Multiplexer - Summary



## Work Conserving Multiplexer - Summary



$$
B_{\max }=\sigma_{1}+\frac{\rho_{1} \sigma_{2}}{b-\rho_{2}}
$$

## Work Conserving Multiplexer - Summary



$$
\begin{aligned}
B_{\max } & =\sigma_{1}+\frac{\rho_{1} \sigma_{2}}{b-\rho_{2}} \\
D_{\max } & =t_{\mathrm{accu}}+t_{\mathrm{drain}}=\frac{\sigma_{1}+\sigma_{2}}{b-\rho_{1}-\rho_{2}}
\end{aligned}
$$

## Work Conserving Multiplexer - Summary



$$
\begin{aligned}
B_{\max } & =\sigma_{1}+\frac{\rho_{1} \sigma_{2}}{b-\rho_{2}} \\
D_{\max } & =t_{\text {accu }}+t_{\text {drain }}=\frac{\sigma_{1}+\sigma_{2}}{b-\rho_{1}-\rho_{2}} \\
F_{3} & \sim\left(\sigma_{1}+\sigma_{2}, \rho_{1}+\rho_{2}\right)
\end{aligned}
$$

## MPEG Encoding Case Study



## MPEG Encoding Case Study



## MPEG Encoding Case Study



## MPEG Encoding Case Study



## MPEG Encoding Case Study



## MPEG Encoding Case Study - cont'd



$$
F_{1} \sim\left(0, \rho_{t}\right)
$$

## MPEG Encoding Case Study - cont'd



## MPEG Encoding Case Study - cont'd



## MPEG Encoding Case Study - cont'd



## MPEG Encoding Case Study - Memory



$$
M^{\prime}:\left(2 \rho_{t}, D_{M^{\prime}}\right)
$$

## MPEG Encoding Case Study - Memory



$$
\begin{aligned}
M^{\prime}: & \left(2 \rho_{t}, D_{M^{\prime}}\right) \\
& \text { For a general multiplexer we have: } \\
D_{\text {mux }}= & \frac{\sigma_{1}+\sigma_{2}}{C_{\text {out }}-\rho_{1}-\rho_{2}} \\
F_{\text {muxout }} \sim & \left(\sigma_{1}+\sigma_{2}, \rho_{1}+\rho_{2}\right)
\end{aligned}
$$

## MPEG Encoding Case Study - Memory



$$
\begin{aligned}
M^{\prime}: & \left(2 \rho_{t}, D_{M^{\prime}}\right) \\
& \text { For a general multiplexer we have: } \\
D_{\operatorname{mux}}= & \frac{\sigma_{1}+\sigma_{2}}{C_{\text {out }}-\rho_{1}-\rho_{2}} \\
F_{\text {muxout }} \sim & \left(\sigma_{1}+\sigma_{2}, \rho_{1}+\rho_{2}\right) \\
F_{M 3} \sim & \left(\rho_{t}\left(D_{1}+D_{\operatorname{mux}}+D_{M^{\prime}}\right), \rho_{t}\right)
\end{aligned}
$$

## MPEG Encoding Case Study - Memory



$$
\begin{aligned}
M^{\prime}: & \left(2 \rho_{t}, D_{M^{\prime}}\right) \\
& \text { For a general multiplexer we have: } \\
D_{\text {mux }}= & \frac{\sigma_{1}+\sigma_{2}}{C_{\text {out }}-\rho_{1}-\rho_{2}} \\
F_{\text {muxout }} \sim & \left(\sigma_{1}+\sigma_{2}, \rho_{1}+\rho_{2}\right) \\
F_{M 3} \sim & \left(\rho_{t}\left(D_{1}+D_{\operatorname{mux}}+D_{M^{\prime}}\right), \rho_{t}\right) \\
F_{M 4} \sim & ?
\end{aligned}
$$

## MPEG Encoding Case Study - cont'd



## MPEG Encoding Case Study - cont'd



## MPEG Encoding Case Study - cont'd



## MPEG Encoding Case Study - cont'd



## MPEG Encoding Case Study - Memory



## MPEG Encoding Case Study - Memory



$$
\begin{aligned}
M^{\prime} & :\left(2 \rho_{t}, D_{M^{\prime}}\right) \\
D_{\mathrm{mux}} & =\frac{\sigma_{1}+\sigma_{2}}{C_{\text {out }}-\rho_{1}-\rho_{2}}=\frac{S_{\text {buffer }}+\rho_{t}\left(D_{1}+D_{3}\right)}{C_{\text {out }}-2 \rho_{t}} \\
F_{M 1} & \sim\left(S_{\text {buffer }}+\rho_{t}\left(D_{1}+D_{3}\right), 2 \rho_{t}\right) \\
F_{M 2} & \sim\left(S_{\text {buffer }}+\rho_{t}\left(D_{1}+D_{3}+2 D_{M^{\prime}}\right), 2 \rho_{t}\right) \\
F_{M 3} & \sim\left(\rho_{t}\left(D_{1}+D_{m u x}+D_{M^{\prime}}\right), \rho_{t}\right) \\
F_{M 4} & \sim\left(S_{\text {buffer }}+\rho_{t}\left(D_{3}+D_{m u x}+D_{M^{\prime}}\right), \rho_{t}\right)
\end{aligned}
$$

## MPEG Encoding Case Study - cont'd



## MPEG Encoding Case Study - cont'd

Backlog of the regulators:


$$
\begin{aligned}
B_{R M 1}= & \max \left(0, \rho_{t}\left(D_{1}+D_{\operatorname{mux}}+D_{M^{\prime}}\right)\right. \\
& \left.-S_{\mathrm{buffer}}\right) \\
B_{R M 2}= & \max \left(0,128 B+\rho_{t}\left(D_{3}+D_{\operatorname{mux}}\right.\right. \\
& \left.\left.+D_{M^{\prime}}\right)-S_{\mathrm{buffer}}\right)
\end{aligned}
$$

Delay of the regulators:

$$
\begin{aligned}
D_{R M 1} & =\frac{B_{R M 1}}{\rho_{t}} \\
D_{R M 2} & =\frac{B_{R M 2}}{\rho_{t}}
\end{aligned}
$$

## MPEG Encoding Case Study - cont'd



## MPEG Encoding Case Study - cont'd



The flow from the memory to S :

$$
\begin{array}{cc}
F_{3} & \sim\left(S_{\text {buffer }}, \rho_{t}\right) \\
C_{2} & :\left(\rho_{t}, D_{2}\right) \\
F_{4} & \sim\left(S_{\text {buffer }}+\rho D_{2}, \rho_{t}\right),
\end{array}
$$

A charatcerization of $S$ and its output:

$$
\begin{aligned}
S & :\left(\rho_{t}, \frac{S_{\text {buffer }}}{\rho_{t}}\right) \\
F_{5} & \sim\left(2 S_{\text {buffer }}+\rho_{t} D_{2}, \rho_{t}\right)
\end{aligned}
$$

The flows between memory and V :

$$
\begin{array}{cc}
F_{8} & \sim\left(S_{\text {buffer }}, \rho_{t}\right) \\
C_{4} & :\left(\rho, D_{4}\right) \\
F_{9} & \sim\left(S_{\text {buffer }}+\rho_{t} D_{4}, \rho_{t}\right)
\end{array}
$$

## MPEG Encoding Case Study - cont'd



## MPEG Encoding Case Study - cont'd



End to end delay:

$$
\begin{aligned}
D_{\text {total }}= & D_{1}+D_{\operatorname{mux}}+D_{M^{\prime}} \\
& +D_{R M 1}+D_{2}+D_{S}+D_{R S}+D_{3} \\
& +D_{\operatorname{mux}}+D_{M^{\prime}}+D_{R M 2}+D_{4}
\end{aligned}
$$

The flow at V :

$$
F_{T \rightarrow V} \sim\left(0+\rho_{t} D_{t o t a l}, \rho_{t}\right)
$$

## Modeling with Regulated Flows

- Interconnect:
* Model each channel by available bandwidth and maximum delay variation;
* Model each node in the interconnect as an arbiter;
- Model read request, write acknowledge as separate flows;
- Model synchronization as separate flows;
- A simple generalization of $(\sigma, \rho)$ flows is

$$
\begin{gathered}
F \sim \min \left(\sigma_{i}, \rho_{i}\right), i>0 \\
F(b)-F(a) \leq \min _{i}\left(\sigma_{i}+\rho_{i}(b-a)\right)
\end{gathered}
$$

- Good analysis depends on good element models;


## Network Calculus - Arrival Curves



Given a monotonically increasing function $\alpha$, defined for $t \geq 0$, $\alpha$ is an arrival curve for flow $F$ if for all $0 \leq a \leq b$ :

$$
F(b)-F(a) \leq \alpha(b-a)
$$

## Network Calculus - Min-Plus Convolusion

Given two monotonically increasing functions $f$ and $g$. The min-plus convolusion of $f$ and $g$ is the function

$$
(f \otimes g)(t)=\inf _{0 \leq s \leq t}(f(t-s)+g(s))
$$

## Network Calculus - Min-Plus Convolusion

Given two monotonically increasing functions $f$ and $g$. The min-plus convolusion of $f$ and $g$ is the function

$$
(f \otimes g)(t)=\inf _{0 \leq s \leq t}(f(t-s)+g(s))
$$

If $\alpha$ is an arrival curve for $F$ we have:

$$
F \leq F \otimes \alpha
$$

## Network Calculus - Min-Plus Convolusion

Given two monotonically increasing functions $f$ and $g$. The min-plus convolusion of $f$ and $g$ is the function

$$
(f \otimes g)(t)=\inf _{0 \leq s \leq t}(f(t-s)+g(s))
$$

If $\alpha$ is an arrival curve for $F$ we have:

$$
F \leq F \otimes \alpha
$$

and

$$
F \leq \alpha \otimes \alpha
$$

with $\alpha \otimes \alpha$ being the best bound that we can find based on information of $\alpha$.

## Network Calculus - Service Curves



Given a system $S$ with an input flow $F$ and an output flow $F^{*}$. S offers the flow a service curve $\beta$ if and only if $\beta$ is a monotonically increasing function and $F^{*} \geq F \otimes \beta$ which means that

$$
F^{*}(t) \geq \inf _{s \leq t}(F(t)+\beta(t-s))
$$

## Network Calculus - Backlog Bound



Given a flow $F$ constrained by arrival curve $\alpha$ and a system offering a service curve $\beta$, the backlog $F(t)-F^{*}(t)$ for all $t$ satisfies

$$
F(t)-F^{*}(t) \leq \sup _{s \geq 0}(\alpha(s)-\beta(s))
$$



Given a flow $F$ constrained by arrival curve $\alpha$ and a system offering a service curve $\beta$, the delay $d(t)$ at time $t$ is

$$
d(t)=\inf \left(\tau \geq 0: F(t) \leq F^{*}(t+\tau)\right)
$$

It satisfies

$$
d(t) \leq h(\alpha, \beta)=\sup _{t \geq 0}(\inf (\tau \geq 0: \alpha(t) \leq \beta(t+\tau)))
$$

## Network Calculus - Output Arrival Curye



Given a flow $F$ constrained by arrival curve $\alpha$ and a system offering a service curve $\beta$, the output flow $F^{*}$ is constrained by the arrival curve $\alpha^{*}$

$$
\begin{gathered}
\alpha^{*}=\alpha \oslash \beta \\
(\alpha \oslash \beta)(t)=\sup _{s \geq 0}(\alpha(t+s)-\beta(s))
\end{gathered}
$$

## Network Calculus - Useful Functions





Peak rate function:
$\lambda_{R}(t)=R t$

Rate latency function:
$\beta_{R, T}(t)=R[t-T]^{+}$

Affine function:

$$
\gamma_{\sigma, \rho}(t)= \begin{cases}0 & \text { for } t=0 \\ \sigma+\rho t & \text { for } t>0\end{cases}
$$



Burst-delay function:

$$
\delta_{T}(t)= \begin{cases}0 & \text { for } t \leq T \\ \infty & \text { for } t>T\end{cases}
$$



Staircase function:

$$
v_{T, \tau}(t)=\lceil(t+\tau) / T\rceil
$$



Step function:

$$
u_{T}(t)= \begin{cases}0 & \text { for } t \leq T \\ 1 & \text { for } t>T\end{cases}
$$

## Network Calculus - Concatenation of Nodes



## Network Calculus - Concatenation of Nodes



Example:

$$
\begin{aligned}
\beta_{1} & =\beta_{R_{1}, T_{1}} \\
\beta_{2} & =\beta_{R_{2}, T_{2}} \\
\beta_{R_{1}, T_{1}} \otimes \beta_{R_{2}, T_{2}} & =\beta_{\min \left(R_{1}, R_{2}\right), T_{1}+T_{2}}
\end{aligned}
$$

## Network Calculus - Concatenation of Nodes



Example:

$$
\begin{aligned}
\beta_{1} & =\beta_{R_{1}, T_{1}} \\
\beta_{2} & =\beta_{R_{2}, T_{2}} \\
\beta_{R_{1}, T_{1}} \otimes \beta_{R_{2}, T_{2}} & =\beta_{\min \left(R_{1}, R_{2}\right), T_{1}+T_{2}}
\end{aligned}
$$

Useful properties: $\quad f \otimes g=g \otimes f$

$$
(f \otimes g) \otimes h=f \otimes(g \otimes h)
$$

$$
(f+c) \otimes g=(f \otimes g)+c \text { for any constant } c \in \mathbb{R}
$$

## Network Calculus - Pay Bursts Only Once




## Network Calculus - Pay Bursts Only Once




$$
\begin{aligned}
\alpha & =\gamma_{\rho, \sigma} \\
\beta_{1} & =\beta_{R_{1}, T_{1}}=R_{1} \max \left(0, t-T_{1}\right) \\
\beta_{2} & =\beta_{R_{2}, T_{2}}=R_{2} \max \left(0, t-T_{2}\right)
\end{aligned}
$$

## Network Calculus - Pay Bursts Only Once




$$
\begin{aligned}
\alpha & =\gamma_{\rho, \sigma} \\
\beta_{1} & =\beta_{R_{1}, T_{1}}=R_{1} \max \left(0, t-T_{1}\right) \\
\beta_{2} & =\beta_{R_{2}, T_{2}}=R_{2} \max \left(0, t-T_{2}\right) \\
\beta_{R_{1}, T_{1}} \otimes \beta_{R_{2}, T_{2}} & =\beta_{\min \left(R_{1}, R_{2}\right), T_{1}+T_{2}}=\min \left(R_{1}, R_{2}\right) \max \left(0, t-\left(T_{1}+T_{2}\right)\right)
\end{aligned}
$$

## Network Calculus - Pay Bursts Only Once




$$
\begin{aligned}
\alpha & =\gamma_{\rho, \sigma} \\
\beta_{1} & =\beta_{R_{1}, T_{1}}=R_{1} \max \left(0, t-T_{1}\right) \\
\beta_{2} & =\beta_{R_{2}, T_{2}}=R_{2} \max \left(0, t-T_{2}\right) \\
\otimes \beta_{R_{2}, T_{2}} & =\beta_{\min \left(R_{1}, R_{2}\right), T_{1}+T_{2}}=\min \left(R_{1}\right. \\
D_{1}+D_{2} & =\frac{\sigma}{R_{1}}+\frac{\sigma}{R_{2}}+\frac{\rho T_{1}}{R_{2}}+T_{1}+T_{2}
\end{aligned}
$$

$$
\beta_{R_{1}, T_{1}} \otimes \beta_{R_{2}, T_{2}}=\beta_{\min \left(R_{1}, R_{2}\right), T_{1}+T_{2}}=\min \left(R_{1}, R_{2}\right) \max \left(0, t-\left(T_{1}+T_{2}\right)\right)
$$

## Network Calculus - Pay Bursts Only Once




$$
\begin{aligned}
\alpha & =\gamma_{\rho, \sigma} \\
\beta_{1} & =\beta_{R_{1}, T_{1}}=R_{1} \max \left(0, t-T_{1}\right) \\
\beta_{2} & =\beta_{R_{2}, T_{2}}=R_{2} \max \left(0, t-T_{2}\right) \\
\otimes \beta_{R_{2}, T_{2}} & =\beta_{\min \left(R_{1}, R_{2}\right), T_{1}+T_{2}}=\min \left(R_{1}\right. \\
D_{1}+D_{2} & =\frac{\sigma}{R_{1}}+\frac{\sigma}{R_{2}}+\frac{\rho T_{1}}{R_{2}}+T_{1}+T_{2} \\
D_{S} & =\frac{\sigma}{\min \left(R_{1}, R_{2}\right)}+T_{1}+T_{2}
\end{aligned}
$$

$$
\beta_{R_{1}, T_{1}} \otimes \beta_{R_{2}, T_{2}}=\beta_{\min \left(R_{1}, R_{2}\right), T_{1}+T_{2}}=\min \left(R_{1}, R_{2}\right) \max \left(0, t-\left(T_{1}+T_{2}\right)\right)
$$

## Summary

- Communication Performance: bandwidth, unloaded latency, loaded latency
- Organizational Structure: NI, switch, link
- Topologies: wire space and delay domination favors low dimension topologies;
- Routing: deterministic vs source based vs adaptive routing; deadlock;
- Quality of Service and flow regulation


## To Probe Further

Classic papers:
[Agarwal, 1991] Agarwal, A. (1991). Limit on interconnection performance. IEEE Transactions on Parallel and Distributed Systems, 4(6):613-624.
[Dally, 1990] Dally, W. J. (1990). Performance analysis of k-ary n-cube interconnection networks. IEEE Transactions on Computers, 39(6):775-785.

## Text books:

[Duato et al., 1998] Duato, J., Yalamanchili, S., and Ni, L. (1998). Interconnection Networks - An Engineering Approach. Computer Society Press, Los Alamitos, California.
[Culler et al., 1999] Culler, D. E., Singh, J. P., and Gupta, A. (1999). Parallel Computer Architecture - A Hardware/Software Approach. Morgan Kaufman Publishers.
[Dally and Towels, 2004] Dally, W. J. and Towels, B. (2004). Principles and Practices of Interconnection Networks. Morgan Kaufman Publishers.
[DeMicheli and Benini, 2006] DeMicheli, G. and Benini, L. (2006). Networks on Chip. Morgan Kaufmann.
[Leighton, 1992] Leighton, F. T. (1992). Introduction to Parallel Algorithms and Architectures. Morgan Kaufmann, San Francisco.
[LeBoudec, 200] Jean-Yves LeBoudec, J-Y. (2001). Network Calculus. Springer Verlag, LCNS 2050

