# Microsoft Research Asia Student TechFest 2016

Empower every person and every organization on the planet to achieve more.





# ClickNP: Highly Flexible and High Performance Network Processing with Reconfigurable Hardware

Bojie Li, Kun Tan, Layong (Larry) Luo, Yanqing Peng, Renqian Luo, Ningyi Xu, Yongqiang Xiong, Peng Cheng Wireless and Networking group



## **Specialization vs. Generalization**

- **Specialized hardware**: efficient but not flexible
- **General-purpose processor**: slow but flexible
- **Reconfigurable hardware**: the sweet point between two extremes



#### Catapult FPGA from MSR NeXT (ISCA '14)

- FPGA is being deployed at scale in **MS data centers**
- Accelerate cloud services and save CPU: Bing search, Azure Networking, Azure Storage...

# **ClickNP element**



### **Challenge: FPGA programming**

- **Low-level** hardware description languages: Verilog, VHDL...
- Hard to program, hard to debug

### **Fortunately, there are High-Level Synthesis tools**

- Develop program in high-level programming language (C-like) and synthesize into hardware description languages
- However, may generate **surprisingly poor hardware** from C code  $\bullet$

# **FPGA** architecture is different from CPU

- **Slow clock frequency**: 200 MHz vs. 2~3 GHz (CPU)
- **Slow memory**: 2~4 GB/s vs. 40~100 GB/s (CPU/GPU)
- But: Massive parallelism
  - 172,600 logic elements
  - 2,014 memory blocks (each 20Kbit)
  - 1,590 DSP blocks
  - Each of them can work in parallel
- Efficient FPGA code must utilize this massive parallelism

# Streaming programming model

Parallel execution



24 }

### **ClickNP element graph**



## **Exploit** parallelism inside element

- Minimize memory dependency
  - **Use registers** for fast access  $\bullet$
  - **Delayed write** to resolve read-write dependency
  - **Memory scattering** to remove pseudo dependency
- Balance pipeline stages
  - **Unroll loops** whenever possible
  - Move slow branch to a separate element  $\bullet$

### **High-throughput and low-latency PCIe channel**



#### **5** applications built with ~100 elements

**Packet generator and capture**: line rate at any packet size

#### **Design goals**

- **Flexibility**: fully programmed using high-level languages
- **Modularized**: modular architecture for packet processing
- **High performance and low latency**: 40 Gbps line rate, 2 microsecond latency for any packet size
- Support joint CPU/FPGA packet processing: Efficient split of  $\bullet$ work between CPU and FPGA

# **ClickNP system architecture**

Microsoft Research

- **Openflow firewall**: 1.23 us latency (similar to ASIC switch, 50x lower than CPU), 56.4 Mpps line rate (3x CPU)
- **IPSec gateway**: 46~200x throughput, 400x lower latency than CPU
- L4 load balancer: 32M concurrent flows, 10M new flows/sec
- **pFabric flow scheduler**: 4G strict flow priorities



ClickNP is the first FPGA-accelerated platform for network functions, written completely in high-level language and achieving 40 Gbps line rate at any packet size.