Dear Author,

Please, note that changes made to the HTML content will be added to the article before publication, but are not reflected in this PDF.

Note also that this file should not be used for submitting corrections.

Journal of Systems Architecture xxx (2015) xxx-xxx

Contents lists available at ScienceDirect

# Journal of Systems Architecture

journal homepage: www.elsevier.com/locate/sysarc

# MultiCS: Circuit switched NoC with multiple sub-networks and sub-channels

Shaoteng Liu<sup>a,\*</sup>, Axel Jantsch<sup>b</sup>, Zhonghai Lu<sup>a</sup>

<sup>a</sup> KTH Royal Institute of Technology, Sweden

9 <sup>b</sup> TU Wien, Vienna, Austria

#### ARTICLE INFO

 13
 Article history:

 14
 Received 26 August 2014

 15
 Received in revised form 9 June 2015

 16
 Accepted 27 July 2015

 17
 Available online xxxx

Keywords:
 NoC
 Circuit switching

21 Multi-channel

22 SDM

23

40

57

58

59

5 6

10

#### ABSTRACT

We propose a multi-channel and multi-network circuit switched NoC (MultiCS) with a probe searching setup method to explore different channel partitioning and configuration policies. Our design has a variable number of channels which can be configured either as sub-channels (spatial division multiplexing channels) or sub-networks. Packets can be delivered on an established connection with one or multiple channels. An adaptive channel allocation scheme, which determines a connection width according to the dynamic use of channels, can greatly reduce the delay, compared to a deterministic allocation scheme. However, the latter can offer exact connection width as requested. The benefits and burden of using different number of channels and configurations are studied by analysis and experiments. Our experimental results show that sub-network configurations are superior to sub-channel configurations in delay and throughput, when working at the highest clock frequency of each configuration. Under reasonable channel enetwork using single wide channels.

© 2015 Elsevier B.V. All rights reserved.

## 35 36 37 38 39

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75 76

77

78

79

80

81

25

26

27

28

29

30

31

32

33

34

## 41 1. Introduction

Compared to packet switched (PS) NoCs with TDM channels,
 circuit switched (CS) NoCs are preferable under certain requirements. For example, the advantages of using CS NoC with spatial
 division multiplexing (SDM) channels on streaming applications
 like 3D graphics have been demonstrated in [1].

With the increasing number of wires available on-chip, the number of possibilities to organize, use and allocate them grows combinatorially. We have more freedom to choose the number of channels and allocate wires for them, instead of giving all wires to only one channel. Thus, how to allocate wire resources and organize multiple channels becomes an interesting research question.

We propose a CS network with multiple channels and multiple
 networks (MultiCS) to explore the effects of different channel par titioning and configuration polices. We offer the following
 contributions:

• The proposed CS NoC combines spatial division multiplexing [1], which we call sub-channels, with sub-networks. Sub-channels divide the wires between two switches which

E-mail addresses: liu2@kth.se (S. Liu), axel.jantsch@tuwien.ac.at (A. Jantsch), zhonghai@kth.se (Z. Lu).

http://dx.doi.org/10.1016/j.sysarc.2015.07.013 1383-7621/© 2015 Elsevier B.V. All rights reserved. then can be allocated separately and independently. Sub-networks are independent networks that connect to the same nodes. A connection between two nodes can utilize one or several sub-channels and one or several sub-networks (Section 3).

- We extend the parallel probing setup method described in [2] to allow a single communication to utilize several sub-networks or sub-channels to meet its bandwidth requirements. This setup method uses a minimal number of extra wires. If a connection, possibly spanning across multiple sub-channels and sub-networks, with sufficient bandwidth is available, it will be found in 3 \* D + 4 cycles, where *D* is the distance between source and destination (Section 3).
- We propose two schemes for building connections with multiple channels. One is called deterministic channel allocation (DCA), which imposes an exact width requirement on a connection. The other is called adaptive channel allocation (ACA), with which the width of a connection is allocated according to run-time channel usage of the network. ACA avoids the channel fragments and superfluous connections problems associated with DCA. However, it guarantees only a minimum connection width (Section 4).





<sup>\*</sup> Corresponding author. Tel.: +46 722935320.

2

86

87

88 89

90 91

92

93

94 95

96

97

98

99

S. Liu et al./Journal of Systems Architecture xxx (2015) xxx-xxx

- We discuss the implementation cost of MultiCS. We also develop an analytical performance model to explain how the maximum throughput is related to packet size and the number of channels (Section 5).
  - We evaluate five configurations of MultiCS under the two schemes. The five configurations vary the number of sub-networks and sub-channels. The experiment results comply with our previous model and analysis (Section 6).

#### 2. Motivation

CS NoC has a fundamental limitation. When a channel between two switches is allocated to one connection, it cannot be used by any other connection. This inherent inflexibility limits the usefulness of CS. A solution is to partition the channel into multiple narrow channels and allocate only one or a few narrow channels to a given connection. An obvious question is, what would be the trade-off of number of narrow channels: as many as possible, or is there any limitation?

100 The second question is: what are pros and cons of different ways of organizing multiple channels? Generally speaking, there 101 are two methods to configure channels of a node: by 102 103 sub-channels or by sub-networks, as illustrated in Fig. 1. The term 104 sub-network refers to several separate circuit switching networks 105 working in parallel. A network interface has access to all available 106 sub-networks, but once data has entered one particular 107 sub-network, it cannot change to another sub-network. In contrast, sub-channels are parallel links between switches. Those 108 sub-channels are organized as SDM inside a switch. Data can use 109 different sub-channels for different hops in the network. 110

The third question is how to establish a connection with multiple channels. Traditionally, even if there are multiple channels, still each packet flow is delivered by building a connection with only one channel [3,4] per hop. However, multiple channels have offered us the possibility of building a wide connection by combining several channels together for data transfer. Thus, we need new schemes and guidelines for connection set-up.

In the following we answer these three questions by comparing alternatives with the same total wire resources and with the same path search and setup algorithm. We construct our platform by combining features and merits from earlier work [2,5,6], and modify them to support multiple sub-channels and multiple sub-networks. We use a mesh topology but arguable many of our main conclusions are also valid for other topologies.

#### 125 **3. Architecture of MultiCS using parallel probing setup**

Our MultiCS NoC can have any number of sub-channels, sub-networks, or a combination of them. The term "sub" denotes the number of sub-networks used, and the term "ch" denotes the number of sub-channels used. Suppose k is the number of sub-channels of a switch and m is number of sub-networks (*subm\_chk*), and given that the total number of wires of a switch is a constant.

#### 133 3.1. Overview of a switch

As shown in Fig. 2, in a mesh topology every switch has five 134 135 directions which are used for connecting to four neighbors and 136 one local resource. Each direction may have multiple duplex chan-137 nels. Each channel contains a data path, which is used for carrying the probe during the setup phase and for transmitting data when a 138 139 connection has been established. Every data path is also associated 140 with 3 bits control signals: an answer (ANS) signal consisting of 2 141 bits, which goes in the opposite direction to the data channel, and 1



Fig. 1. Splitting wide channel into sub-channels or sub-networks.



Fig. 2. Overview of a switch.

bit for a request signal, which travels in the same direction as the data channel. When the request signal is '1', a probe search is running or data transfer is active. When request signal is '0', it denotes the idle state, and an established connection will be released. The usage of the ANS signal is listed in Table 1. 143

The messages required for connection configuration are simplified by using these control signals: tear-down message is carried by the request signal, and Ack/Nack messages are delivered by ANS signal. They are free of contention.

In this switch architecture, the overhead of each channel is just the 3-bit control signals. We believed this is the minimum requirement for supporting probing based setup. Although it may be argued that the 1 bit request signal can be omitted, this would increase the logic inside a switch.

The probe format is also compact. It contains source address, destination address and the channel id of the network interface of the source (Table 2). In an  $8 \times 8$  mesh with 4 sub-channels/sub-networks, the minimum width of a probe is 14 bits. Each probe is one flit in length. Thus the width of a data path is ranged from 14 bits to any bits.

#### 3.2. Path searching algorithm

Generally speaking, our platform is not bound to a particular 163 path searching algorithm, meaning that any algorithm can be 164

161 162

147

148

149

150

151

152

153

154

155

156

157

158

159

160

3

S. Liu et al./Journal of Systems Architecture xxx (2015) xxx-xxx

Table 1

Usage of ANS signal.

| ANS | Usage                 |  |
|-----|-----------------------|--|
| 00  | Idle                  |  |
| 01  | Probe search continue |  |
| 10  | Probe failed          |  |
| 11  | Path established      |  |

165 applied. The performance comparison results of different path 166 searching algorithms are shown in Section 6.2.1.

167 The parallel probing algorithm [2] is chosen as our default path 168 searching algorithm in this paper because of its high performance. Parallel probing is an adaptive path searching algorithm. The 169 170 fundamental idea was proposed by us in [2]. We illustrate this idea in Fig. 3. Suppose node 1 want to set-up a path to the destination 171 node 9. In the beginning, a setup probe carrying information such 172 as source address, destinations address and channel id is generated 173 174 by the network interface and enters into node 1. Then, node 1 sends out two probes to the neighboring nodes 2 and 4. In the sec-175 176 ond hop each probe splits into two, and all the probes continue to 177 move towards the destination along all the possible minimum 178 paths.

179 Whenever two probes containing the same setup request meet, 180 one of them is regarded as redundant and is canceled, and all channels used only by the canceled probe are released. For example, 181 182 when node 5 receives two probes which are the same, one of them is canceled and all the channels it has booked before are released, 183 184 (we have marked the released channel with a cross marker) as 185 shown in Fig. 3b). Note that, the channel between node 1 and node 186 2 is not released, because it is still needed for the probe that has 187 traveled further to node 3.

188 The release process proceeds backwards hop by hop. The switch 189 does a release based on the registered connectivity information 190 that connects an input port to an output port. When a release signal (ANS = 10, "probe failed") arrives at an output port from a 191 downstream switch, the corresponding input port is looked up, 192 the connection is canceled, and the release signal is forwarded 193 194 upstream to the input port. Applying this mechanism, if several possible paths exist, one and only one of them can be finally 195 booked, just as desired. 196

In this way, a wavefront of probes travel through the network 197 198 and reach the destination on a minimal path. The time is exactly 2D, where D is the distance in terms of hops, and it takes 2 cycles 199 200 to traverse each hop. When a probe successfully reaches the desti-201 nation, an acknowledgment is sent back to the source node, which 202 takes 1 cycle per hop.

203 However, in our previous work [2], this algorithm is designed for circuit switched NoC with one wide link per direction. It applied 204 a complicated priority based arbitration mechanism which is not 205 206 applicable for multiple sub-channel usage. It requires a sorting 207 component when applied in a multi-sub-channel switch, which 208 is too costly to implement in hardware. Instead we use a round-robin based allocation. The set-up clock frequency is 209 210 increased from 570 MHz [2] to 1.1 GHz (using SMIC 90 nm Tech).

#### 3.3. Operation flow 211

212 Our circuit switching network has six operations which are 213 explained in Fig. 4. In stage 1, a probe carrying setup information

Table 2 Drobo format

| 6-Bit src. addr | 6-Bit dest. addr | 2-Bit channel id |
|-----------------|------------------|------------------|



(a) In each node a probe may double. (b) When two probes meet, one is cancelled.

Fig. 3. The overview of the parallel probing algorithm.



Fig. 4. Operation flow of parallel probing method.

is sent out from source node and moving towards the destination 214 according to the following algorithm. At each hop, if there are free 215 channels, the probe will book one channel and move forward 216 (stage 2). Otherwise, the probe is failed and it will use ANS signal 217 (Nack) to clear all the channels it has already been booked (stage 218 3). When the source node gets notice that all the sent out probes 219 have failed (stage 4), it will retry again. If a probe successfully 220 reaches its destination, the Ack signal will be sent back, which 221 means the connection has been established and data transfer can 222 be launched (stage 5). After data transfer finished (stage 6), con-223 nection will be torn-down. It will then go back to stage 1 to wait 224 and serve new setup request. 225

#### 3.4. Detailed switch architecture

The internal structure of a switch is shown in Fig. 5. It is divided into two parts: control path and data path. The data path transfers data through the configured data crossbar. The control path is used to set up or tear-down a data path. The control path and data path share the same input and output wires.

We designed an allocator to do channel allocation for the probes arriving simultaneously at a switch. The principle of our single cycle maximal allocator with round-robin fairness is similar to a wave-front allocator [7], but it is smaller, fairer, faster and free of combinational loops. Detailed description of this allocator is described in our work [8].

It should be noted that the area of this allocator increase with  $O(n^2)$ , the critical path length scales O(n), with n being the number of channels per direction. For example, in a 1-channel-per-direction switch, the allocator in charge of each output direction consists of

226 227

228

229

230

231

232

233

234

235

236

237

S. Liu et al./Journal of Systems Architecture xxx (2015) xxx-xxx



Fig. 5. Internal structure of a switch.

only 4 tiles, while in a 4-channel-per-direction switch, the allocatorper direction contains 64 tiles.

#### 244 3.4.1. Source synchronized data transfer

The control path and data path can work at different clock fre-245 246 quencies and share the same wires without interference. This clock 247 scheme takes advantage of the property of circuit switched NoC 248 that the setup phase never overlaps with the data transfer phase. 249 During the connection setup phase, the data path should have no 250 active clock signal, thus it is idle. And the cross-bar of the data path 251 can be configured under the control path clock (probe clock). 252 During the data transfer phase, the control path ignores data vari-253 ations on the shared wire links. It just listens to the request and ANS signals. 254

Therefore, we can utilize either source synchronous data transfer [9–11] or clock gating to realize this separation of data and control path clock schemes, so that the data transfer can benefit from a higher clock frequency. In this paper, we chose the former source synchronous data transfer. The usage of this technique on CS NoC has been justified by Pham et al. [11,6].

#### 3.4.2. Predictable delay

261

One of the benefits of the probing set-up approach is pre-262 dictable latency. In our design, each probe takes 2 clock cycles 263 264 per hop, and the ANS signal takes 1 cycle per hop. So, it takes at 265 most 3 \* D + 4 cycles for a probe to travel from source to destina-266 tion and back the ANS signal (D is the hop distance between source 267 and destination). 4 cycles is the overhead consumed in the source 268 and destination nodes. Therefore, in an n \* n mesh the worst case 269 for a single search takes 3 \* (2 \* n - 2) + 6 cycles, no matter if the result is a success or a failure. There is no such bound for the packet 270 271 configuration approach such as [12]. It is reported that it takes 76 272 cycles on average for 6 hop distances [12], while it is fixed to 22 273 cycles by using our probing approach.

For data transfer, the head flit takes two cycles per hop, and the 274 following flits are pipelined. 275

276

296

#### 3.4.3. Configurable sub-channels and sub-networks

Even though the total number of wires between switches is the 277 same in different configurations, their costs and performances are 278 different. Fig. 6 depicts configurations sub1\_ch2 and sub2\_ch1. 279 Their intra-switch connection relationship of channels is unveiled 280 by using a switch block diagram, in which a line denotes a 281 bi-directional connection between two duplex channels. For exam-282 ple, in the multi-sub-channel sub1\_ch2 case, an output channel 283 can be connected to all input channels except to the ones of the 284 same direction. Since each output channel needs to select from 8 285 input channels, which means an 8-to-1 multiplexer is required 286 by each output channel, and thus the switching logic in total has 287 ten 8-to-1 multiplexers. However, in the multi-sub-network 288 sub2\_ch1, a channel is restricted to connect to the channels of 289 the same sub-network. Thus, each output channel just needs a 290 4-to-1 multiplexer and the entire switching logic is just made up 291 of ten 4-to-1 multiplexer. As a result, for a given number of wires, 292 although sub-channel configuration offers more switching flexibil-293 ity [13] than sub-network, its switching logic is much more 294 complicated. 295

#### 4. Connection building schemes in MultiCS

Since a network interface has access to all available 297 sub-networks and sub-channels (Fig. 6), we have the flexibility to 298 build a wide connection with one or more channels per hop to deliver a packet, rather than only one-channel-per-hop. For example, 300 in a sub2\_ch2 network, each resource node is connected to 4 input-output channel pairs, so at most it can use all its 4 output 302 channels for one connection. In order to set up a connection with 303



320

321

322 323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344



Fig. 6. Switch block diagram of sub1\_ch2 and sub2\_ch1.



Fig. 7. Illustration for channel fragments.

4-channel width, the resource node has to send 4 setup probes out
through its 4 output channels. Each probe will try to set up a narrow connection of 1-channel width to the destination. After the
success of all 4 probes, a 4-channel connection is established.

We propose two schemes to explore the opportunities and challenges of building connections with multiple channels. We name the two schemes as deterministic channel allocation (DCA) and adaptive channel allocation (ACA), respectively.

*DCA*: DCA imposes mandatory requirement on the connection width. For example, if a packet has a connection width requirement of 4 bytes, it is restricted to use two 2-bytes channels per hop, or four 1-byte channels per hop to build up a connection; any allocation below or above this figure is unacceptable.

ACA: ACA scheme has no hard connection width requirement.
 During the setup phase, a setup request tries to utilize as many
 channels as possible to build a connection. However, the final



Fig. 8. Illustration for a superfluous connection.

connection width is determined by the number of successful setup probes, which depends on the run-time congestion situation of the network.

The DCA scheme is intended to provide desired and predefined throughput and flit width for data transfer. It imposes strict requirement on the setup phase. DCA scheme is useful in the circumstances when an end-to-end transfer has exact predefined throughput or flit width requirement.

The ACA scheme is designed to achieve better performance, at the expense of only minimum throughput and flit width (1-channel width) guarantee. Depending on channel use, ACA adaptively sets up a connection, of which the width of a connection can vary from 1 to k channel-width, where k is the total number of output channels per direction. As a result, at low load connections are likely to be wide and thus data transfer time can be shortened; at high load connection width tends to be narrower because of contention. Thus, more requests can be served and high throughput can be reached.

The traditional one-channel-per-connection (OCPC) scheme is a special case of DCA scheme (DCA with 1-channel width requirement). In OCPC scheme, the width of every connection is restricted to 1-channel width.

The implementation of DCA may introduce additional steps during setup. For example, if a packet in a sub4\_ch1 or sub1\_ch4



Fig. 9. Evaluation of path searching algorithm by using sub1\_ch4 in ACA-FPS (packet size 5120 bytes).

S. Liu et al./Journal of Systems Architecture xxx (2015) xxx-xxx

345 configuration has a DCA requirement of 8-byte width, and each 346 channel is 2-byte wide, the resource node has to wait until four 347 2-byte output channels of the network interface are available, 348 and then four probes are sent out simultaneously. Only if all of 349 the probes succeed, the data transfer can commence. Otherwise, the one-channel connections (superfluous connections) established 350 will be released and all four probes will be re-sent again until all 351 352 four succeed. The release of superfluous connections is adopted to avoid deadlock. 353

The implementation of ACA takes advantage of the predictable 354 355 set-up delay of MultiCS. For each setup request, a resource sends out probes through all free output channels of the network inter-356 face, and each probe tries to build up a one-channel connection. 357 358 After one probe succeeds, the sender will wait a short period 359 (<(3 \* D + 4) cycles) for all outstanding probes to return their 360 results, then combine all the established one-channel connections together to form a wide connection for transfer. 361

## **362 5. Cost and performance analysis**

## 363 5.1. Implementation cost

364

365

366

367

368

369

370

371

372

378

379

380

381

Suppose *k* is the number of sub-channels of a switch and *m* is number of sub-networks (*subm\_chk*), and given that the total number of wires of a switch is a constant.

The critical latency of the data path of a switch is chiefly decided by the crossbar latency, which scales with  $O(\log k)$ , and it is independent of *m*.

The area of a data cross-bar scales  $O(k^2)$  with sub-channels, and it is again independent of *m*. The registers inside the data path take a large part of area, but their number is independent of *k* and *m*.

To the latency of the control path, the allocator contributes O(k)latency and the cross-bar contributes  $O(\log k)$ . Combining both we see that latency scales O(k) with sub-channels. Again, the latency is independent of the number of sub-networks m. The area of control path scales O(m) with sub-networks and

The area of control path scales O(m) with sub-networks and  $O(k^2)$  with sub-channels. This is because using sub-networks causes a linear increase of certain components, e.g. FSMs. However, using sub-channels will cause certain components, e.g. allocator, have a  $k^2$  increase.

The synthesis results of a few configurations are listed in Table 3 with SMIC 90 nm library. The number of wires per-direction of each configuration is the same, i.e. 8 bytes. The total power and area per node is reported by Design Compiler and calculated at each one's maximum clock frequency.

Generally speaking, synthesis results are in accordance with our expectation. For example, sub1\_ch1 has the smallest area and power consumption, and can work at the fastest clock frequency. Sub4\_ch1 has the same frequency as sub1\_ch1, while sub1\_ch4 consumes the largest area because it has an  $O(k^2)$  increase in area, and works at the slowest clock frequency due to its O(k) latency scale factor.

# 394 5.2. An analytical performance model

We propose a model for per-node maximum throughput analysis. We assume that every node inside a network has the same behavior and the network achieves the maximum throughput when a node is always busy in requesting connection setup or transferring data. This means that there is no idle time.

Suppose  $t_0$  is the time used for data transfer when a connection has been established,  $t_1$  is the time consumed for a single search (it has a bounded value in our approach),  $\alpha$  is the failure rate  $(1 - \alpha$  is the success rate). Suppose further that the intensity (average

http://dx.doi.org/10.1016/j.sysarc.2015.07.013

number of certain events per time unit) of successful transfers of a node is  $\lambda(A)$ , the intensity of a single search is  $\lambda(B)$ . Based on the conclusions of Palm Calculus [14], we have 406 407

$$\frac{\lambda(A)}{\lambda(B)} = 1 - \alpha \tag{1}$$

The average time between two single searches is  $1/\lambda(B)$ , which is equal to

$$\frac{1}{\lambda(B)} = (1 - \alpha)(t_0 + t_1) + \alpha t_1$$
(2)
414

As there is always either a search or a data transfer going on,  $\lambda(A)t_0 + \lambda(B)t_1 = 1$ .

The average time between two successful searches is  $1/\lambda(A)$ , and, combining (1) and (2), we obtain

$$\frac{1}{\lambda(A)} = t_0 + \frac{t_1}{1 - \alpha} \tag{3}$$

According to (3), maximum normalized throughput (bandwidth utilization rate) is

$$THN = \frac{t_0}{t_0 + \frac{t_1}{1 - \alpha}} \tag{4}$$

Suppose *B* is the bandwidth of a resource node, then the maximum throughput of each resource is

$$TH = THN * B = \frac{Bt_0}{t_0 + \frac{t_1}{1 - \alpha}}$$
(5) 431

This simple model can explain the following intuitions:

- I. As the packet size increases,  $t_0$  goes up and thus *TH* goes up, i.e. CS NoC becomes more efficient with larger packets.
- II. Assume a fixed packet size of M (bytes) and a total bandwidth B of a resource node. If each node just has one channel, then the required time for data transfer is  $t_0 = M/B$ , and from (4) we obtain the normalized throughput *THN*. However, if we allocate the total bandwidth into two channels, with each one B/2 the bandwidth, then the data transfer time for a packet become  $t'_0 = 2t_0$ , and the normalized throughput becomes

$$THN' = \lambda(A)^* t'_0 = \frac{2t_0}{2t_0 + \frac{t_1}{1 - \alpha}}$$
(6)

Thus, splitting a wide channel into narrow channels increases the maximum throughput. This conclusion is less expected but can intuitively be explained by the fact, that if the channel bandwidth is smaller, the penalty of not using this bandwidth during setup is also smaller.

This model is a simplification. For example, in reality,  $\alpha$  in formula (4) is not a constant. It depends on a number of factors such as  $t_0$ , topology, sub-network/sub-channel configuration, connection build-up schemes and so forth. Although the model is fairly simple we will see in the subsequent simulation results, that the main conclusions are confirmed.

## 5.3. Analysis of connection building schemes

Please cite this article in press as: S. Liu et al., MultiCS: Circuit switched NoC with multiple sub-networks and sub-channels, J. Syst. Architect. (2015),

The behavior of ACA in general follows our previous intuitions, as we will see in Section 6, e.g. Fig. 10.

For DCA, however, contrary to our intuition II, under certain circumstances, the throughput of multiple narrower channels is inferior to a single wide channel. Two phenomena degrade the performance of DCA. 437

410

411

412

415

416

417

418

419

421

422

423

424

427

428

429

432

433

438 439 440

441 442

443

445

446

447

448

449 450 451

452 453 454

455 456

458 459 460

461

462

463

S. Liu et al./Journal of Systems Architecture xxx (2015) xxx-xxx



Fig. 10. Influences of packet size on maximum throughputs.

#### 5.3.1. Channel fragments 464

As Fig. 7 illustrates, in multiple channel configurations such as 465 466 sub2\_ch1 or sub1\_ch2, a DCA requirement demands a 2-channel connection and such a connection is possibly built by using chan-467 468 nels from two distinct routes<sup>1</sup> (Fig. 7b)). This kind of channel alloca-469 tion will generate channel fragments. Although there is a free channel between nodes SRC and 01, it cannot be utilized by node 470 00 to build a connection with 2-channel requirement to node 01. 471 472 In comparison, single channel configuration sub1\_ch1 has the double 473 channel width of sub2\_ch1 or sub1\_ch2. Thus, it can build a 474 one-channel connection to satisfy the same DCA requirement. 475 There is no fragment in sub1\_ch1 under such a DCA requirement. 476 Channel fragments are the result of increased flexibility when given 477 a higher number of narrower channels. But for certain traffic scenar-478 ios this increased flexibility is doing more harm than good, as we will 479 see in Fig. 15.

#### 480 5.3.2. Superfluous connections

Connections that are setup but cannot be used, are superfluous. 481 482 Suppose we have a CS NoC as sub2\_ch1, and each channel is 2-byte 483 wide. For example, as Fig. 8 suggests, a connection with 4 bytes 484 width requirement needs to send out 2 probes and set up two 485 one-channel connections simultaneously with DCA scheme. However, since some of the channels have already been occupied, 486 only one one-channel connection can be established. As a result, 487 488 since the DCA requirement cannot be satisfied, data transfer cannot 489 be launched and thus the established one-channel connection becomes superfluous. The superfluous connection will be released 490 491 and then a new setup is attempted. However, the reserve and 492 release of these superfluous connections inside a CS NoC puts a 493 heavy burden on the network and degrades performance.

#### 6. Experiments and evaluations 494

In this section, we will check whether experiment results are in 495 496 accordance with our design goals, intuitions and analysis. All experiments are based on  $8 \times 8$  mesh topology. Uniform random 497 traffic with Poisson arrival time distribution is used for evaluation 498 499 purpose.

In addition to the four configurations in Table. 3, configuration 500 501 sub2\_ch1 is also used in our experiments. This configuration has two sub-networks, each of which has a channel width of 4 bytes, so that the total channel width per-direction is also 8 bytes. The clock frequencies of sub2 ch1 are the same as sub4 ch1.

We use two scenarios which use ACA and DCA, respectively, to compare the performance of different number of sub-channels and sub-networks, as well as the path searching algorithms. In both scenarios we include several test cases, such as fixed packet size case (FPS) (all packets have the same size), variable packet size case (VPS) (all packets have random packet sizes). However, since FPS and VPS have similar results, we just show the results of FPS.

When nothing else is specified, we use the parallel probing algorithm by default.

#### 6.1. Simulation method and metrics

Inside each resource node a generator generates setup requests according to a probability and pushes them into a queue. An FSM pops a request from the queue and sends it out when sufficient output channels are available. Then the FSM waits for a success or failure notification. Then it either retries the request or commences the data transfer.

We have implemented an HDL model for synthesis and for area and power evaluation. We have also built a cycle accurate SystemC simulator which can run 10-30 times faster than the HDL model. Any data point that is shown in the figures comes from simulation of 250 million cycles, of which the first 250 thousand cycles are discarded as warm up period.

Suppose  $\beta$  is the packet generation probability and *M* is the packet size (in bytes), and *clk*<sub>freq</sub> is the data path clock frequency of a configuration, *B* is the bandwidth of a resource node, then the injection rate per node (IR) is defined as

$$IR = \beta * M * clk_{freq}$$

Besides throughput and delay, we use bandwidth utilization efficiency (Eb), also called normalized throughput, as one of the metrics. It is defined as

$$Eb = \frac{Throughput (per node)}{Bandwidth (per node)}$$
539

In our simulations each configuration operates at its maximum frequency.

## 6.2. ACA (adaptive channel allocation) evaluation

#### 6.2.1. Evaluation of path searching algorithms

Three path searching algorithms are compared, which are x-y544 [15], minimal adaptive [15] and parallel probing [2]. The results 545 in Fig. 9 suggest that parallel probing is the best path searching 546 algorithm for ACA scheme. E.g. at offered load 0.35, the average 547 packet delay of parallel probing is only 83% of minimal adaptive, 548 and 57% of x-y algorithm. We also have evaluated algorithms in 549 550 different channel number and configurations. Their results suggest 551 the same ranking of algorithms. Consequently, we choose parallel probing as our default path searching algorithm. 552

#### 6.2.2. Influences of packet size on maximum throughputs

The influence of packet size on maximum normalized through-554 put is shown in Fig. 10, which suggests that as packet size 555 increases, the maximum normalized throughput for each configu-556 ration also goes up. This result complies with intuition 1 from our 557 model. Thus, we may safely conclude that CS NoC is suitable for 558 delivering large packets. This is the reason why throughout this 559 paper we prefer large packets for evaluations. This conclusion 560 implies that applications that generate large bulk of data for com-561 munication, like task allocation and migration on MPSoC, or page 562

Please cite this article in press as: S. Liu et al., MultiCS: Circuit switched NoC with multiple sub-networks and sub-channels, J. Syst. Architect. (2015), http://dx.doi.org/10.1016/j.sysarc.2015.07.013

7

502

503

504

505

506

507 508

509

510

511

512

513

514

515

516

517

518

519 520

521

522

523

524

525

526

527

528

529

530 531

533

534

535

536 537

540

541

542

543

<sup>&</sup>lt;sup>1</sup> In this situation, each flit will be split into two phits at the source, with each route simultaneously delivering only one phit. At the receiver side, it will restore a flit by combining the two phit together.

S. Liu et al./Journal of Systems Architecture xxx (2015) xxx-xxx

8

Table 3

Per-node synthesis results of different CS NoCs with 8 bytes of wires per-direction.

| Configuration                 | Sub1_ch4 | Sub2_ch2 | Sub4_ch1 | Sub1_ch1 |
|-------------------------------|----------|----------|----------|----------|
| Channel width                 | 2 Bytes  | 2 Bytes  | 2 Bytes  | 8 Bytes  |
| Num. sub-network              | 1 (Sub1) | 2 (Sub2) | 4 (Sub4) | 1 (Sub1) |
| Num. sub-channel              | 4 (Ch4)  | 2 (Ch2)  | 1 (Ch1)  | 1 (Ch1)  |
| Max. probe freq. (MHz)        | 556      | 740      | 1111     | 1111     |
| Max. data freq. (GHz)         | 1.116    | 1.397    | 1.786    | 1.786    |
| Total area (um <sup>2</sup> ) | 150214.5 | 86777.5  | 57599.5  | 30874.9  |
| Probe path area               | 85791.8  | 53161.3  | 35632.8  | 8908.2   |
| Data path area                | 64422.7  | 33616.2  | 21966.7  | 20133.7  |
| Power@max. freqs. (mW)        | 50.1     | 45.0     | 49.5     | 26.8     |

based virtual memory management, benefit from CS NoCs, while
 applications with mostly short messages may prefer PS NoCs, as
 concluded in [16].

566 6.2.3. Evaluation of different number of channels

The experiment results of splitting a wide channel into narrow sub-channels is shown in Figs. 11 and 12. The packet size is fixed to 5120 bytes in this evaluation.

570 As shown in Fig. 12, sub4\_ch1 provides higher throughput than 571 sub2\_ch1, which in turn is better than sub1\_ch1. E.g. the maximum 572 throughput of sub4\_ch1 is about 17% higher than sub1\_ch1. The 573 increase in maximum throughput complies with our intuition II. 574 We can imagine that some packets in the sub4\_ch1 configuration 575 use only 1, 2, or 3 of the subnetworks. However, e.g. using 1 sub-576 network with 1/4 the channel width compared to the sub1\_ch1 con-577 figuration means that the packet consists of  $4 \times$  the number of flits. 578 As we studied in the last section, larger packet sizes lead to higher 579 maximum throughput in CS NoC. Thus, using narrow sub-links will 580 achieve higher maximum throughput because the average packet 581 size counted in flits is also larger.

Regarding delay, as Fig. 11 suggests, sub1\_ch1 has better packet 582 583 delay results only when the injection rate is low. This is due to the 584 connection setup delay contributes little to the total packet delay 585 because of low contention probability. In this situation, data trans-586 fer delay dominates the total packet delay. Sub1\_ch1 has shorter data transfer delay due to its wider channel. However, at high 587 588 injection rate, sub4\_ch1 outperforms sub1\_ch1. For example, at injection rate 3500 MB/s, the average packet delay of sub4\_ch1 is 589 590 20% less than sub1 ch1.

If throughput is our main concern, the number of channels
should be maximized. However, in our design, the minimum channel width is decided by probe format, which is about 14 bits.
Narrower than this value complicate the probe delivering process.

#### 595 6.2.4. Evaluation of different configurations (Fig. 13)

596

597

598

599

600

601

602

603

604

605

606

607

608

Although sub1\_ch4 has more switching flexibility than sub4\_ch1, this advantage is compensated by its slower clock frequency. As a result, sub-network configuration (sub4\_ch1) outperforms sub-channel configurations (sub2\_ch2 and sub1\_ch4) in delay and throughput. Sub1\_ch1 has lower maximum throughput than the other multi-channel configurations. However, it presents better latency at low load for the same reasons explained above.

Bandwidth utilization efficiency discounts the difference in frequency and gives a performance comparison under the assumption that the networks operate at the same frequency. Sub1\_ch4 has the best bandwidth utilization efficiency under ACA scheme. For example, we observed 30% higher efficiency than sub1\_ch1 in ACA scheme. Sub2\_ch2 and sub4\_ch1 fall in between.

Bandwidth utilization efficiency may also be useful because in
certain situations the maximum clock frequency differences of
configurations are not sharp. For example, as reported in [25],
when implemented in FPGA, a CS NoC with SDM channels roughly
has the same maximum clock frequency no mater if 1 or 4



Fig. 11. Delay influence of dividing a wide channel into narrow sub-channels (packet size 5120 bytes).



Fig. 12. Throughout influence of dividing a wide channel into narrow sub-channels (packet size 5120 bytes).

sub-channels are used. In situations like this, sub1\_ch4 could offer better performance than other configurations.

We also tested under ACA scenarios with variable size of packets. The comparison among different configurations basically shows consistent results and is thus omitted here.

#### 6.2.5. Comparison between ACA scheme and one-channel-perconnection (OCPC) scheme

OCPC is compared with ACA by using configuration sub4\_ch1, as Fig. 14 suggests, at low load ACA offers much better average packet delay. E.g. at load 0.02, average packet delay with ACA is 170 probe clock cycles, while it is 490 cycles with OCPC. At very high load, OCPC represents slightly higher bandwidth utilization efficiency, and its maximum bandwidth utilization efficiency is 0.283, while for ACA it is 0.271.

The comparison result obeys our design goals of ACA in Section 4. At low load when data transfer delay dominates, ACA can significantly shorten the delay since the majority of packets are delivered by wide connections. At high load, due to contention, the probability of building a connection containing multiple channels is low and the majority of connections contain one channel only. Thus, the average packet length in flits by using ACA at high load is just slightly smaller than using one-channel-per-connection

614 615

616 617 618

- 619
- 620 621 622

623

624

625

626 627 628

S. Liu et al./Journal of Systems Architecture xxx (2015) xxx-xxx



Fig. 13. Performance results of scenario ACA-FPS (packet size 5120 bytes).



Fig. 14. Comparison between ACA scheme with traditional OCPC scheme (packet size 1280 bytes).

scheme. As suggested by intuition II, the maximum bandwidth uti-636 637 lization of ACA drops slightly (smaller than 5% in this experiment).

#### 6.3. DCA (deterministic channel allocation) evaluation 638

639 As mentioned before, ACA puts more emphasis on performance, while DCA focuses on desired throughput and flit width. 640

#### 6.3.1. Evaluation of different number of channels 641

Previously, we showed that ACA benefits from the usage of mul-642 tiple channels. However, the following example with DCA tells a 643 644 different story.

645 Fig. 15 shows the result of a simulation with the exact connection width requirement of 8 bytes and a fixed packet size of 5120 646 bytes. In this case, as the latency curves suggest, more channels 647 lead to higher delay. 648

649 This observation complies with our analysis in Section 5.3. Sub4\_ch1 and sub2\_ch1 perform worse than sub1\_ch1, since 650 sub1\_ch1 generates neither channel fragments nor superfluous 651 connections. Sub4\_ch1 performs worse than sub2\_ch1 because it 652 653 generates more superfluous connections.

654 In addition, it is worth noting that, if less than half of the band-655 width can be utilized, splitting the wide channel into sub-links 656 seems to be beneficial, even without special care on channel frag-657 ments and superfluous channels. In Fig. 16, the exact throughput 658 requirement is 4 bytes/cycle. Because only half of the channel 659 width in sub1\_ch1 can be utilized for data transfer, sub1\_ch1 is 660 inferior to the other multi-channel configurations. This result also



Fig. 15. Influence of dividing a wide channel into narrow sub-links for DCA transfer (packet size 5120 bytes).

suggests that, according to the connection width requirement, proper channel partitioning could still be beneficial.

#### 6.3.2. Evaluation of different configurations

For delay and throughput, both Figs. 16 and 17 demonstrate that sub-network configuration sub4\_ch1 outperforms sub-channel configurations sub2\_ch2 and sub1\_ch4. These results are similar to those under ACA. However, in Fig. 17, the channel utilization efficiency of sub1\_ch4 is worse than sub2\_ch2, which is worse than sub4\_ch1. This observation opposes the result with ACA and is not quite expected. It seems that switching flexibility becomes a handicap in this DCA case.

The reason for this phenomenon is that sub1\_ch4 and sub2\_ch2 are more likely to generate superfluous connections. Due to the increased switching flexibility, sub1\_ch4 and sub2\_ch2 have higher chances to set up a one-channel connection, which leads to a higher burden on the network due to set up and release of superfluous connections.

Generally speaking, as the comparison of Figs. 16 and 17 with Fig. 13 suggests, we may conclude that ACA offers better performance than DCA scheme. However, as mentioned before, DCA offers exactly predefined connection width and throughput.

We also tested DCA scenarios with variable size of packets. The comparison among different configurations basically shows consistent results and is thus omitted here.

## 7. Related work

The usage of sub-network and sub-channel in PS NoC has been 686 studied during the past. For example, the cost and effect of 687

685

661

662

663

664

665

666

667

668

669

670

671

672

673

674

675

676

677

678

679

680

681

682

683

Please cite this article in press as: S. Liu et al., MultiCS: Circuit switched NoC with multiple sub-networks and sub-channels, J. Syst. Architect. (2015), http://dx.doi.org/10.1016/j.sysarc.2015.07.013

698

699

700

701

702

703

704

705

706

707

708

709





Fig. 16. Performance results of scenario DCA-FPS (packet size 2560 bytes, connection width requirement is 4 bytes).



Fig. 17. Performance results of scenario DCA-FPS (packet size 5120 bytes, connection width requirement is 8 bytes).

introducing sub-networks into PS NoC has been studied by Yoon et 688 689 al. in [17,18]. The pros and cons of using sub-channels in PS NoC 690 has also been investigated in [3]. Besides, work [19,20] intend to 691 increase the switching flexibility between virtual channels and 692 the output ports of PS NoCs. In [19], several separate cross-bars 693 are used inside one router, so that each virtual channel (VC) can 694 choose between multiple crossbars to reach an output. In [20], a new switching layer is introduced at each input port, so that mul-695 tiple VCs of an input port can be connected to different outputs at 696 697 the same time.

However, compared with PS NoC, the usage of sub-network and sub-channel in CS NoC is not fully exploited and evaluated. Actually, in the past, the CS NoC architecture assumed in many papers (e.g. [2,11,21]) has just one duplex-channel between every two neighboring nodes. They did not consider the situation when a CS NoC can have multiple physical channels between two nodes.

Although some works [4,22] design CS NoCs with multiple channels and organize them in a sub-channel (SDM) way, the consequences of applying multiple channels in CS NoCs are still not well studied. For example, although [4,22] have multiple channels, packets are still delivered by following connections with only 1-channel width.

Another import aspect about CS NoC is connection setup, since a 710 711 CS NoC requires a connection should be established before data 712 transfer begins. According to the connection search and setup 713 method. CS NoCs can be classified into two categories: dynamic 714 setup or static setup. Static setup methods schedule connections at 715 compilation time. As a result, they [23,24] may not well support 716 applications like H.264 [25] with requirements for dynamic commu-717 nication setups. Therefore, in this paper, we only focus on dynamic 718 methods which search and setup connections at run time.

Dynamic methods can be further classified into centralized ordistributed methods. Generally speaking, centralized set-up like

[21,26] has two disadvantages. Firstly, the central schedule node 721 needs to receive setup/release requests and distribute allocation 722 decisions from/to the entire network. Such multiple-to-one and 723 one-to-multiple traffic pattern is likely to become the system bottle-724 neck which the number of nodes inside a NoC grows [27]. Secondly. 725 since retrying of failed requests causes the blockage of the following 726 requests, failed setup requests are usually dropped in centralized 727 setup methods. Thus, we focus on decentralized setup. 728

Distributed setup can be implemented by sending configuration packets [4,28,29] or by a probing search approach [1,11,6,5].

729

730

Sending configuration packets requires a separate PS (packet 731 switched) NoC to deliver configuration messages like set-up, 732 tear-down and Ack/Nack during a connection setup procedure. In 733 our view, this approach suffers from four major drawbacks. 734 Firstly, using an additional PS NoC for connection set-up is an 735 unnecessary overhead. Secondly, set-up, tear down and Ack/Nack 736 packets of a connection must be routed by pre-determined routing 737 algorithm to ensure them on the same connection. For example, 738 [4,29] use deterministic routing algorithm, and in [28], source 739 based routing information has to be carried by each configuration 740 packet. However, such pre-determined routing algorithm is a 741 sub-optimal choice among routing algorithms. Thirdly, compared 742 with probing search, tear-down and Ack/Nack signals have to be 743 sent in the form of packets. These packets will contend with 744 set-up packets inside the PS NoC. There is typically no delay guar-745 antee for configuration packets in the PS network, rendering the 746 connection set-up procedure unpredictable. Fourthly, this 747 approach does not scale well. The auxiliary PS NoC has fixed 748 throughput, since each output port of a switch just allows to deli-749 ver one setup packet at a time. However, if there are many 750 sub-channels in a CS NoC, and since each sub-channel requires a 751 separate setup packet for connection configuration, this will signif-752 icantly increase the number of setup packets as observed in [4]. 753

S. Liu et al./Journal of Systems Architecture xxx (2015) xxx-xxx

816 817

826

827

828

858 859

860

861

891

892

893

894

895

896

754 Compared with above mentioned shortcomings of a packet con-755 figuration approach, probing search is the superior choice because 756 of its efficiency in wire usage and connection setup procedure. The 757 concept of the probing search was first proposed in [5]. Pham et al. 758 [11,6] developed a backtracking path searching algorithm, which reportedly has better performance than [5]. Another contribution 759 760 of [11,6] is that a source synchronized data transfer mechanism is introduced into CS NoCs, so that separate clocks can be applied 761 to connection set-up and data transfer. [2] developed a parallel 762 probing method for CS NoC. It can complete a search over all pos-763 sible paths within O(n) time complexity where *n* is the geometric 764 distance between source and destination. They demonstrated 765 superior performance of this parallel probing algorithm compared 766 to Pham's backtracking algorithm [11] by experiments. But their 767 768 channel allocation mechanism [2] is too complicated for 769 multi-sub-channel usage.

770 The probing search approaches in all aforementioned works [1,11,6,5] are only implemented on CS NoC with a single channel 771 772 between two neighboring nodes.

In this paper, we extend the parallel probing search method [2] 773 774 to multiple sub-channels and sub-networks and study cost and 775 performance of several configurations with sub-channels and 776 sub-networks among 1 and 4.

#### 777 8. Conclusion and future work

778 We have implemented MultiCS, a CS with multiple sub-channels and sub-networks with a parallel probing setup algo-779 780 rithm to study the consequences of splitting a wide channel into 781 narrow channels. The design space of multi-channel CS NoC is 782 explored from two angles: the channel number, and channel con-783 figurations. We have reached the following main conclusions:

784 A. Given a number of wire resources for each node inside a CS NoC, with ACA scheme, the thinner the channel width with 785 more channels, the higher the throughput. However, the 786 latency for data transfer also increases by using thinner 787 channels. Sub-channels (SDM channels) consume much 788 789 more resources than sub-networks. When splitting a wide channel into *n* narrow channels, organizing those channels 790 791 into sub-networks gives an O(n) increase in area, and the 792 critical latency is unchanged. However, organizing those 793 channels in sub-channels increases the area by  $O(n^2)$ , and the delay by O(n). Furthermore, our experiments suggest 794 795 that sub-networks offer better performance than 796 sub-channels. Although sub-channels can achieve better 797 channel efficiency due to higher switching flexibility, this 798 is only useful in special situations. Thus, in general 799 sub-networks are more efficient than sub-channels.

800 B. We can build a connection consisting of multiple channels with different schemes. The DCA offers desired and prede-801 fined throughput and flit width, but channel fragments and 802 superfluous connections are two obstacles for DCA. 803 Because of this, under certain width requirements, the per-804 formance of using multiple channels is even worse than 805 using one single wide channel. ACA generally offers better 806 performance than DCA. However, although ACA provides 807 minimum connection width guarantee (one channel width). 808 809 the actual width of a connection by applying ACA cannot be 810 known beforehand. The connection width is decided by the 811 success probability of setup probes, which depends on the dynamic channel use. 812 813

814 Our future work will study techniques to avoid channel frag-815 ments and superfluous connections, in depth evaluation of multi-channel CS NoC, and implementation and evaluation of mixed packet and circuit switched NoCs.

#### References

- [1] A. Leroy, P. Marchal, A. Shickova, F. Catthoor, F. Robert, D. Verkest, Spatial division multiplexing: a novel approach for guaranteed throughput on NoCs, in: Proceedings of the IEEE/ACM/IFIP International Conference on Hardware/ Software Codesign and System Synthesis, 2005, pp. 81-86.
- [2] S. Liu, A. Jantsch, Z. Lu, Parallel probing: dynamic and constant time setup procedure in circuit switching NoC, in: proceedings of Design, Automation Test in Europe Conference Exhibition (DATE'12), 2012, pp. 1289–1294.
- [3] C. Gomez, M. E. Gomez, P. Lopez, J. Duato, Exploiting wiring resources on interconnection network: increasing path diversity, in: Proceedings of Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP'08), 2008, pp. 20-29.
- [4] A.K. Lusala, J.-D. Legat, Combining SDM-based circuit switching with packet switching in a router for on-chip networks. Int. J. Reconfigurable Comput. 2012 (2012) 1 - 16.
- [5] D. Wiklund, D. Liu, SoCBUS: switched network on chip for hard real time embedded systems, in: Proceedings of Parallel and Distributed Processing Symposium, 2003, p. 8.
- P.-H. Pham, P. Mau, J. Kim, C. Kim, An on-chip network fabric supporting coarse-grained processor array, IEEE Trans, Very Large Scale Integr, VLSI Syst. 21 (99) (2013) 178-182.
- D.U. Becker, W.J. Dally, Allocator implementations for network-on-chip routers, in: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, 2009, pp. 52:1-52:12.
- [8] S. Liu, A. Jantsch, Z. Lu, A fair and maximal allocator for single-cycle on-chip homogeneous resource allocation, IEEE Trans. Very Large Scale Integr. VLSI Syst 22 (10) (2014) 2229-2233
- D. Walter, S. Hoppner, H. Eisenreich, G. Ellguth, S. Henker, S. Hanzsche, R. Schuffny, M. Winter, G. Fettweis, A source-synchronous 90 Gb/s capacitively driven serial on-chip link over 6 mm in 65 nm CMOS, in: proceedings of Solid-State Circuits Conference Digest of Technical Papers (ISSCC'12), 2012, pp. 180-182
- [10] D. Schinkel, E. Mensink, E. Klumperink, E. van Tuijl, B. Nauta, Low-power, highspeed transceivers for network-on-chip communication, IEEE Trans. Very Large Scale Integr. VLSI Syst. 17 (1) (2009) 12-21.
- [11] P.-H. Pham, J. Park, P. Mau, C. Kim, Design and implementation of backtracking wave-pipeline switch to support guaranteed throughput in network-on-chip, IEEE Trans. Very Large Scale Integr. VLSI Syst. 20 (2) (2012) 270-283.
- A.K. Lusala, J.-D. Legat, A SDM-TDM-based circuit-switched router for on-chip networks, ACM Trans. Reconfigurable Technol. Syst. 5 (3) (2012) 15:1-15:22.
- [13] J. Rose, S. Brown, Flexibility of interconnection structures for fieldprogrammable gate arrays, IEEE J. Solid-State Circuits 26 (3) (1991) 277-282.
- [14] J.Y. Le Boudec, Performance Evaluation of Computer and Communication Systems, Epfl Press, 2011.
- [15] W.J. Dally, B. Towles, Principles and Practices of Interconnection Networks, Morgan Kaufmann, 2003.
- [16] S. Liu, A. Jantsch, Z. Lu, Analysis and evaluation of circuit switched NoC and packet switched NoC, in: Proceedings of Euromicro Conference on Digital System Design (DSD'13), 2013, pp. 21-28.
- [17] Y. J. Yoon, N. Concer, M. Petracca, L. Carloni, Virtual channels vs. multiple physical networks: a comparative analysis, in: Proceedings of IEEE Design Automation Conference (DAC'10), 2010, pp. 162–165.
- [18] Y.J. Yoon, N. Concer, M. Petracca, L.P. Carloni, Virtual channels and multiple physical networks: two alternatives to improve NoC performance, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 32 (12) (2013) 1906-1919.
- [19] S. Noh, V.-D. Ngo, H. Jao, H.-W. Choi, Multiplane virtual channel router for network-on-chip design, in: Proceedings of First International Conference on Communications and Electronics (ICCE'06), 2006, pp. 348-351
- [20] F. Gilabert, M.E. Gómez, S. Medardoni, D. Bertozzi, Improved utilization of NoC channel bandwidth by switch replication for cost-effective multi-processor systems-on-chip, in: Proceedings of the ACM/IEEE International Symposium on Networks-on-Chip (NOCS'10), 2010, pp. 165-172.
- [21] M. Winter, G.P. Fettweis, Guaranteed service virtual channel allocation in NoCs for run-time task scheduling, in: Proceedings of Design, Automation Test in Europe Conference Exhibition (DATE'11), 2011, pp. 1–6.
- [22] A. Leroy, D. Milojevic, D. Verkest, F. Robert, F. Catthoor, Concepts and implementation of spatial division multiplexing for guaranteed throughput in networks-on-chip, IEEE Trans. Comput. 57 (9) (2008) 1182–1195.
- [23] R. Stefan, A. Molnos, K. Goossens, dAElite: a TDM NoC supporting QoS, multicast, and fast connection set-up, IEEE Trans. Comput. PP (99) (2012) 1.
- [24] K. Goossens, J. Dielissen, A. Radulescu, AEthereal network on chip: concepts, architectures, and implementations, IEEE Des. Test Comput. 22 (5) (2005) 414-421.
- [25] N. Ma, Z. Lu, L. Zheng, System design of full HD MVC decoding on mesh-based multicore NoCs, Microprocess. Microsyst. 35 (2) (2011) 217-229.
- [26] M. Winter, G.P. Fettweis, A network-on-chip channel allocator for run-time task scheduling in multi-processor system-on-chips, in: Proceedings of EUROMICRO Conference on Digital System Design Architectures, Methods and Tools (DSD'08), 2008, pp. 133-140.

## SYSARC 1297 13 August 2015

# **ARTICLE IN PRESS**

12

897

898

899

900

901

902

903

904

905

906

907

908 911

912

913

914

915

916

917

918

919

#### S. Liu et al./Journal of Systems Architecture xxx (2015) xxx-xxx

- [27] S. Liu, A. Jantsch, Z. Lu, Parallel probe based dynamic connection setup in TDM NoCs, in: Proceedings of the Conference on Design, Automation & Test in Europe (DATE'14), 2014, pp. 239:1–239:6.
- [28] J. Lim, E. Hunt Siow, Y. Ha, P.K. Meher, Providing both guaranteed and best effort services using spatial division multiplexing NoC with dynamic channel allocation and runtime reconfiguration, in: Proceedings of International Conference on Microelectronics (ICM'2008), 2008, pp. 329–332.
- [29] A.K. Lusala, J.-D. Legat, Combining sdm-based circuit switching with packet switching in a NoC for real-time applications, in: Proceedings of IEEE International Symposium on Circuits and Systems (ISCAS'11), 2011, pp. 2505–2508.



Zhonghai Lu received the B.Sc. degree from Beijing Normal University, Beijing, China, in 1989, and the M.Sc. and Ph.D. degrees from KTH Royal Institute of Technology, Stockholm, Sweden, in 2002 and 2007, respectively. He is currently an Associate Professor with KTH. His research interests include Network-on-Chip, Embedded Systems, Computer Architecture, and Internet-of-Things. He has published over 130 peer-reviewed papers in transactions, journals and international conferences in these areas.

947

936

937

938

939

940

941

942

943

944

945

946



Shaoteng Liu received the B.Sc. degree from Fudan University, Shanghai, China, in 2006. He received his M.Sc. degree from Royal Institute of Technology (KTH), Stockholm in 2010. He is currently a PHD student at KTH. His current research interests include system modeling, performance analysis, embedded operating system, reconfigurable computing, network-on-chip and software defined network.



Axel Jantsch received the Dipl.Ing. and Dr.Tech. degrees from the Technical University of Vienna, Vienna, Austria, in 1988 and 1992, respectively. He was a professor of electronic system design with the Royal Institute of Technology, Stockholm, Sweden, from December 2002 to September 2014. He is currently a professor in system on chip with TU Wien, Vienna, Austria. His current research interests include VLSI design and synthesis, system-level specification, modeling and validation, HW/SW co-design and co-syntheses, reconfigurable computing, and networks-on-chip.