Design Challenges of Manycore based SmartNIC & DPU

Author: Cheng Zhang Date : 2025-06-17





## Why Do We Need SmartNIC & DPU?





The contradiction between

the rapid increase in data volume and the slowing down of Moore's Law

Cited from MIT 6.829 Slides

# **Different Types of NICs**

tunneling protocols

such as VXLAN and

technologies such as

OpenFlow and OVS,

the complexity of

network processing is

gradually increasing,

requiring more CPU

resources to be

consumed.

virtual switching



- Fixed-function NIC hardware
- **Basic** function implementation
- Simple feature offload (CKS, LRO/LSO)

#### **Traditional NIC**

Simple Function Implementation



- \* Match-Action Table (OVS dataplane fastpath)
- Rich function implementation - EC/Encrypt/Compress/...
- More feature/protocol offload
- RoCE/RoCEv2, TCP...
- IPSec/xxxSec
- VirtIO-NET/BLK/SCSI/...
- NVMe/NoF/...



How to achieve "zero consumption" of host CPUs has become the next research direction for cloud vendors.



#### DPU

Data-Plane + Control-Plane Offload

#### **SmartNIC**

Data-Plane Offload

#### **Mainstream Products: Architecture Classification**

|          | Data-plane Architecture           |                                |                                |                                   |                         |  |  |  |  |
|----------|-----------------------------------|--------------------------------|--------------------------------|-----------------------------------|-------------------------|--|--|--|--|
| ASIC     |                                   | FPGA                           | NPU-Pipeline                   | Many Core                         | ASIC+NPU                |  |  |  |  |
| SmartNIC | NV ConnectX-7                     | NetFPGA                        |                                | Netronome CX/LX<br>Marvell OCTEON | Huawei<br>NV ConnectX-8 |  |  |  |  |
| DPU      | NV Bluefield1/2<br>BRCM Stringray | Intel FPGA IPU<br>Xilinx Alveo | AMD Pensando<br>Intel ASIC IPU | Huawei<br>Netronome FX            | NV Bluefield3           |  |  |  |  |

Manycore based architecture improve its *flexibility* and adapt to a wide range of application scenarios!

### **A Reference Architecture of Manycore based DPU**

|                              | Reference Architecture                                                                                                                |  |  |  |  |
|------------------------------|---------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|
| Target Scenario              | <ul> <li>Storage</li> <li>Smart Compute</li> <li>Cloud Compute</li> </ul>                                                             |  |  |  |  |
| Architectural<br>Description | <ul> <li>Data Plane: programmable core cluster but without strong cache coherence</li> <li>Control Plane: General CPU Core</li> </ul> |  |  |  |  |
| Architecture                 | CPU Cluster<br>CPU CPU<br>CPU CPU<br>CPU CPU<br>Accelerator<br>Datapath                                                               |  |  |  |  |

### **Challenge 1: Shared Context Processing on Manycore**



**Competition & Conflict** 

Multiple cores/threads which process different packets of the same flow will access the same context



## **Challenge 1: Shared Context Processing on Manycore**



#### **Common Ideas: Using multiple small locks**

So that multiple cores/threads can process different packets of the same flow in parallel.



- Observation 1: You need more finer locks
  - The maximum parallelism for a single flow is limited by number of locks.
- > Observation 1: You need more time for operating locks
  - High latency due to long waiting time for packets in lock queues due to empty bubbles.

#### Possible Explorations:

- Domain specific locks.
- Closer coupling of locks and operations.

### **Challenge 2: One Architecture Meets All**

- > **Observation 1**: Different stateful applications may have different sensitivities to different metrics.
  - Some applications are more sensitive to throughput, while others are more sensitive to latency.
- > **Observation 2**: RTC/PIPELINE processing model may have different affinities.
  - Pipeline model is high throughput oriented, while RTC model is low completion latency.
- > **Possible Exploration** : Employ flexibility NOC and cores to meet different metrics.

| Application<br>Scenario   | Low<br>Latency | High<br>Throughput | Flexibility  | Large<br>Flow Number |                                                                       |
|---------------------------|----------------|--------------------|--------------|----------------------|-----------------------------------------------------------------------|
| HPC                       | 1              | 0                  | 0            | 1                    | Cluster Pipeline Core Cluster Cluster RTC                             |
| Distributed<br>Database   | 1              | 0                  | 0            | 0                    | Core     Core     Core     Core       Core     Core     Core     Core |
| Distributed<br>Storage    | 1              | 0                  | 1            | $\checkmark$         | Core Core Core Core                                                   |
| Virtualization<br>/ Cloud | 0              | 1                  | $\checkmark$ | 1                    | Core Core Core                                                        |
| AI LLM                    | 0              | $\checkmark$       | 0            | 1                    |                                                                       |
|                           |                |                    |              |                      |                                                                       |

# **Challenge 2: One Architecture Meets All**

#### > Observation 1: How to mitigate context synchronizations?

- context access/passing accounts for non-negligible processing latency.
- Some contexts show locality shared among physical cores.

#### > Possible Exploration:

• Faster and reconfigurable contexts swapping mechanisms.

#### **>Observation 2:** How to flexibly and adaptively schedule cores?

- Core scheduler: Allocating cores based on concurrent latency/bandwidth sensitive workloads.
- Packet dispatcher: Distribute packets to appropriate cores for application processing.

#### > Possible Exploration:

• Adaptive core scheduler and flexible packet dispatcher like FlexPath NP.

#### **Regional CC Coherency Region 1** CPULCP CPU CPL CPU CPU CPU CPU CPU CPU CPU CPU L1 L1 CPU CPU CPUCPU CPU CPU CPU CPL CPU CPU CPU CPU CPU CPU Memory **Coherency Region 2**



- Dhlendorf, R., Meitinger, M., Wild, T., & Herkersdorf, A. (2010). FlexPath NP—Flexible, Dynamically Reconfigurable Processing Paths in Network Processors. Dynamically Reconfigurable Systems: Architectures, Design Methods and Applications, 355-374.
- A. Srivatsa, S. Rheindt, T. Wild and A. Herkersdorf, "Region based cache coherence for tiled MPSoCs," 2017 30th IEEE
   International System-on-Chip Conference (SOCC), Munich, Germany, 2017, pp. 286-291

### **Challenge 3: Achieving Ultra High Bandwidth**

>Observation 1: Non-linear scale problem for manycore processing.

- But for the future NIC how to achieve 1.6 Tbps+ processing performance with manycore architecture?
- Especially, it is the most challenge to support line-rate stateful applications.

**Possible Exploration:** does it still should be defined as a traditional NIC?

• Merging high bandwidth Eth/PCIe switching TOR with NIC.



From 400Gbps, 800Gbps, 1.6Tbps

To 3.2+Tbps

## **Challenge 4: Architectural Evolution for Future**

Observation 1: how to evolve the architecture, enhance low latency and high bandwidth capabilities without losing existing programmable capabilities?



Recent Commercial DPU products employs similar approaches

(e.g., Azure employs SmartNIC pooling for NF offloading [NSDI '23])

# Q & A