

DESY technical seminar Zeuthen – 8 Nov 2011

Filippo Mantovani filippo.mantovani@desy.de



(F. Mantovani, D. Pleiter, F. S. Schifano, H. Simma)



# Why a custom network?

Scalability  $\Leftrightarrow$  torus:

- → cost and complexity (linear in # of procs)
- → performance (in particular nearest neighbours)
- → latency (in particular nearest neighbours)

Integration:

- → size
- ➔ power

Tightly coupled computing nodes:

- → close interface to CPU (N-P)
- → light-weight protocol (N-N)

Filippo Mantovani - 08.11.2011

### **Previous examples**

|                  |             |            |            | •            |  |
|------------------|-------------|------------|------------|--------------|--|
|                  | unit        | APEnext    | BG/P       | Cell/QPACE 🌢 |  |
|                  |             | 2006       | 2008       | 2009         |  |
| f <sub>clk</sub> | [GHz]       | 0.13       | 0.85       | 3.2          |  |
| # of cores       | _           | 1          | 4          | 8            |  |
| DP peak          | [Gflops]    | 1          | 13.6       | 100          |  |
| Power            | [W/Gflop]   | 9          | 3          | 1.5          |  |
| Memory bw        | [GByte/s]   | 2          | 13.6       | 25           |  |
|                  | [word/flop] | 1/4        | 1/8        | 1/32         |  |
| Network bw       | [Gbyte/s]   | 0.67       | 2.55       | 5.5          |  |
|                  | [word/flop] | 1/12       | 1/42       | 1/145        |  |
| Network latency  | [ns]        | $\sim 300$ | $\sim 800$ | $\sim 3000$  |  |









- → **P:** Processor / CPU (general purpose)
- → N: Network processor / NWP (implemented on FPGA)

### **Operations:**

- → **put:** Initiator = source
- → get: Initiator = destination

Filippo Mantovani - 08.11.2011















# Goals Implement PCIe-based interface of the network processor Micro-benchmark of a communication model between CPU and FPGA Replace external PHYs by high speed transceivers within FPGA Test stability of the physical link with internal transceivers

### Hardware setup:

Aurora (in comparison with QPACE):

- → Cell → Intel Nehalem X5570@2.93 / E5540@2.53
- →  $FlexIO \rightarrow QPI + IOH + PCIe$



Filippo Mantovani - 08.11.2011











### Sub-parts of the design: **PCIe core** 1. PCI-express architecture in a nutshell; ΡΙϹ POC 2. The FPGA design ТΧ RX implementing an Nget engine to fetch data for transmission; Fifo Fifo Link Link Transceiver Network processor FPGA







| Filippo Mantovani - 08.11.2011 | - 18 - |
|--------------------------------|--------|
|                                |        |
|                                |        |
|                                |        |



Transaction Layer is responsible for:

- → Storing negotiated and programmed configuration information
- → Managing link flow control
- → Enforcing ordering and Quality of Service (QoS)
- ➔ Power management control/status

Header information may include:

- → Address/Routing
- ➔ Data transfer Length
- → Transaction descriptor

End to End CRC checking provides additional security (optional)

### PCIe architecture -2-

Relevant Transaction Layer Packets in our talk are:

- → Memory Read MemRd: 16 Byte header, no payload;
- → Memory Write MemWr: 16 Byte header, max payload 1024 Byte;
- → Completion w Data CpID: 16 Byte header, max payload 1024 Byte;

Interface Altera PCIe Core  $\Leftrightarrow$  **Avalon Bus** 







1. The processor triggers a read operation sending a read request to the network processor;

| Filippo Mantovani - 08.11.2011                                | - 21 -     |
|---------------------------------------------------------------|------------|
|                                                               |            |
|                                                               |            |
|                                                               |            |
| The Nget engine -1-                                           |            |
| A communication scheme in which:                              | H          |
| CPU<br>read<br>MemRd FPGA                                     |            |
| 1. The processor triggers a read energian conding a read regu | act to the |

- 1. The processor triggers a read operation sending a read request to the network processor;
- 2. The FPGA processes requests and sends read command to the processor;



- 1. The processor triggers a read operation sending a read request to the network processor;
- 2. The FPGA processes requests and sends read command to the processor;
- 3. The processor answers with the data to send through the network;



- 1. The processor triggers a read operation sending a read request to the network processor;
- The FPGA processes requests and sends read command to the processor;
- 3. The processor answers with the data to send through the network;
- 4. When all data are arrived, the FPGA writes a notification message in a memory location so that the processor can detect the end of the operation.

Filippo Mantovani - 08.11.2011



# **Driver:**Low level library:<br/> $\Rightarrow$ read/write via IOCTL and/or<br/>memory map; $\Rightarrow$ init/release/reset;<br/> $\Rightarrow$ read, write (IOCTL/mm); $\Rightarrow$ polling on the notify locations. $\Rightarrow$ nget, nget\_wait;**Communication library:**<br/> $\Rightarrow$ TX<sub>i</sub> (trigger a send $P \rightarrow N$ ) $\rightarrow$ (nget)<br/> $\Rightarrow$ RX<sub>i</sub> (issue credit to receive data $N \rightarrow P$ ) $\rightarrow$ (write)<br/> $\Rightarrow$ TX<sub>f</sub> (test notification about completed TX) $\rightarrow$ (nget\_wait)

→  $RX_f$  (test notification about completed RX)  $\rightarrow$  (poll)

### **Driver modes**

- → Data buffer allocation:
  - 1. User-space;
  - 2. Kernel-space (copy required);
  - **3.** Kernel-space (memory map);
- → Write Nget requests via:
  - **1.** IOCTL operations;
  - 2. Write operations on locations that are memory mapped;
- → To detect the end of an Nget operation the network processor writes a location in CPU's main memory.

In order to detect the notification the application triggering the Nget operation can:

- 1. allocate a memory location in user space and poll it;
- leave to the driver the task to allocate memory for notification in kernel space and check for memory update using a standard polling method;
- **3.** use the Intel macro **monitor/mwait** (in kernel-space).

Filippo Mantovani - 08.11.2011

### Algorithm of the Nget micro-benchmark

To benchmark Nget design transactions are started in a loop such that

- → Inside the main loop up to N transactions are in flight.  $N \le 64 \rightarrow \max \#$  of PCIe tags supported by the macro
- → During each loop iteration M new transactions are started and then the algorithm waits for M outstanding transactions to complete.
- → Two concurrently active transactions differ in (lnk,vc,tag).

All the Nget involved are of the same size (*L* bytes)

|       | _ |        |           |              |     |      |     |
|-------|---|--------|-----------|--------------|-----|------|-----|
| Alias | I | -1024  | -512      | <br><u> </u> | 512 | 1024 | 153 |
| iCp   |   |        |           |              |     |      |     |
| cmd   |   |        |           |              |     |      |     |
| oMr   |   |        |           |              |     |      |     |
| nfy   |   |        |           | Ш            |     |      |     |
| rdy   |   | L=0256 | N=08 M=04 |              |     |      |     |
|       |   |        |           |              |     |      |     |

### Pseudo-code of the Nget micro-benchmark

```
get time stamp #S
/* Start-up transactions */
for (i=0; i<N-M; i++) {
    update (lnk,vc,ttag)
    start nget(lnk, vc, ttag, dmabufhp[j], L);</pre>
}
/* Main loop */
for (k=0; k<K; k++) {
   for (i=0; i M; i++) {
    update (lnk,vc,ttag)
    start nget(lnk, vc, ttag, dmabufhp[j], L);
   }
   for (i=0; i<M; i++) {</pre>
       update (lnk,vc,ttag)
start nget_notify_wait(lnk, vc, ttag);
   }
   get time stamp #k
}
/* Drain */
/* blain */
for (i=0; i<N-M; i++) {
    update (lnk,vc,ttag)
    start nget_notify_wait(lnk, vc, ttag);</pre>
3
get time stamp
print time stamps #E
```

26

```
Filippo Mantovani - 08.11.2011
```









### **Transceivers (PIPE interface)**

- → 10 bit PMA-PCS interface width (between the PMA and PCS layer, i.e. after 10/8 encoding).
- → Serial link data rates: 2.5 5 Gbps.
- → Supported channel bonding ×1, ×4, ×8 (×4).
- → Automatic word aligner.
- → Manual word deskew (implemented in VHDL).
- → Correct byte misalignments due to byte SerDes.
- → 8 or 16 bit per lane.
- → Frequency 250 MHz.



### Transceivers (clock tree)

- coreclkout: transcv output, AL input, clock for TX inside the transmitter (250 MHz);
- → PCIe\_core\_clk\_out: clock from PCle core (250 MHz);
- → cal\_blk\_clk: calibration clock (10-125 MHz);
- ➔ fixed\_clk: RX PIPE interface (125 MHz);
- → reconfig\_clk: for transcv dynamic reconfiguration (37.5-50 MHz).





### **Conclusion remarks**

- PCle protocol overhead is relatively low (5% of the theoretical bandwidth).
- Data fragmentation due to IOH configuration can degrade the bandwidth.
- Micro-banchmarks of latency and bandwidth has been performed using the Nget engine exercise in several environments.
- → Nget engine communication scheme has pros and cons. Pros: keep the CPU free during the read. Cons: introduce latency due to the need of request/notification for each message transmission.
- → High latency to detect notification.
- → Peak bandwidth ~ 82% of the theoretical bandwidth (for payloads ≥ 1 KB).
- → Embedded high-speed transceiver allows to double bandwidth (GEN2), to reduce latency and to simplify hardware implementation of the physical link.

Filippo Mantovani - 08.11.2011

## Outlook

- → Tune parameters (preemphasis and VOD) of the transceiver;
- → Extract EyeQ diagram from transceiver configurator block;
- → Interconnect gpu1  $\Leftrightarrow$  gpu2.
- → Insert the ftnw link modules;

**N.B.:** The number of transceivers in the generation of FPGA used in our tests are not enough to implement a whole network (6 high speed links for network + 1 bus PCle)