

#### Reexamining Direct Cache Access to Optimize I/O Intensive Applications for Multi- hundred- gigabit Networks

<u>Alireza Farshin</u><sup>\*</sup>, Amir Roozbeh<sup>\*+</sup>, Gerald Q. Maguire Jr.<sup>\*</sup>, Dejan Kostić

\* KTH Royal Institute of Technology, School of Electrical Engineering and Computer Science (EECS)







- 1. I/O device DMAs\* packets to main memory
- 2. CPU later fetches them to cache







- 1. I/O device DMAs\* packets to main memory
- 2. CPU later fetches them to cache



- Large number of accesses to main memory
- High access latency (>60ns)
- Unnecessary memory bandwidth usage





## **Direct Cache Access (DCA)**

- 1. I/O device DMAs packets to main memory
- 2. DCA exploits TPH\* to prefetch a portion of packets into cache
- 3. CPU later fetches them from cache





## **Direct Cache Access (DCA)**

- 1. I/O device DMAs packets to main memory
- 2. DCA exploits TPH\* to prefetch a portion of packets into cache
- 3. CPU later fetches them from cache



**CPU Socket** 



## Intel Data Direct I/O (DDIO)

- DDIO in Xeon processors since Xeon E5
- DMA packets or descriptors directly to/from Last Level Cache (LLC)





#### More in-network computing + offloading capabilities

Push costly calculations into the network and perform **stateful** functions at the processor, which makes applications more I/O intensive.



Every 6.72 ns a new (64-B+20-B\*) packet arrives at 100 Gbps





Without DCA we are unable to process I/O at line rate, thus *increasing* packet loss or latency when utilizing multi-hundred-gigabit networks.



\* A PCIe 3.0 16x slot is capable of providing ~125 Gbps effective full-duplex bandwidth.



<sup>2020-07-02</sup> \* A PCIe 3.0 16x slot is capable of providing  $\sim$  125 Gbps effective full-duplex bandwidth.



Writing packets/descriptors: DDIO overwrites a cache line **if** it is already present in *any* LLC ways (≡ write update or hit)





Writing packets/descriptors: DDIO overwrites a cache line **if** it is already present in *any* LLC ways (≡ write update or hit)

Otherwise, DDIO allocates a cache line in a limited portion of LLC (≡ write allocate or miss)





Writing packets/descriptors: DDIO overwrites a cache line **if** it is already present in *any* LLC ways (≡ write update or hit)

Otherwise, DDIO allocates a cache line in a limited portion of LLC (≡ write allocate or miss)

#### Reading packets/descriptors:

NIC reads a cache line if it is already present in *any* LLC ways (≡ read hit)

Otherwise, NIC reads it from main memory (≡ read miss)





Designed a set of micro-benchmarks to learn about DDIO:

- Which ways are used for allocation?
  - How does DDIO interact with other applications?
  - Does DMA via a remote CPU socket pollute LLC?























## LLC ways used by DDIO



<sup>2020-07-02</sup> \* Cache Allocation Technology

















## LLC ways used by DDIO



<sup>2020-07-02</sup> \* Cache Allocation Technology

## LLC ways used by DDIO



<sup>2020-07-02</sup> \* Cache Allocation Technology



## How does DDIO perform?

DDIO *cannot* provide expected benefits!

- ResQ\* [NSDI'18]
- Intel reports

Write-allocate DDIO could evict *not-yet-processed* and *already-processed* packets from LLC

Packet should be read from main memory rather than LLC

Reduce the number of RX descriptors so that the buffer fit in the limited DDIO portion.



#### Reducing #Descriptors is Not Sufficient! (1/2)





#### Reducing #Descriptors is Not Sufficient! (2/2)



1500-B Packets x 256 x 18  $\approx$  6.59 MB >> 4.5 MB

> Forwarding 1500-B Packets at 100 Gbps with 256 per-core RX descriptors

> > DDIO should be able to perform well with high number of RX descriptors!



#### Tuning a little-discussed register can improve the performance of DDIO

Default value is 0x600







## Impact of Tuning DDIO

DDIO's effect on hit rates can affect application-level performance based on an application's characteristics

For example, an I/O intensive application: 2 cores forwarding 1500-B Packets at 100 Gbps



## Impact of Tuning DDIO

DDIO's effect on hit rates can affect application-level performance based on an application's characteristics

For example, an I/O intensive application: 2 cores forwarding 1500-B Packets at 100 Gbps



# Is Tuning DDIO Enough?

Tuning is **not** a perfect solution! Due to:

- Cache is used for code/data,
- Smaller per-core cache quota, and
- Coarse-grained partitions.

Next generation DCA should provide:

**Fine-grained placement**: Similar to CacheDirector\* [EuroSys'19] **I/O isolation**: Extend CAT<sup>+</sup> and CDP<sup>++</sup> to include I/O **Selective DCA/DMA**: only transfer relevant parts of the packet to LLC

+ Cache Allocation Technology ++ Code/Data Prioritization



DMA should **not** be directed to the cache if this would cause I/O evictions!

**Bypassing cache** is beneficial in multi-tenant/application environment, where some performance isolation is desired.

- Disabling DDIO for a specific PCIe port
- Exploiting a remote socket



#### Using Our Knowledge for 200 Gbps



Device under Test Forwarding Packets



Tuning DDIO improves packet processing at 200 Gbps



Latency of the first NIC versus aggregate Rate

Better cache management is necessary for multi-hundred-gigabit-per-second networks



See our paper for more results about:

- How does receiving rate affect the DDIO performance?
- How does processing time affect the DDIO performance?
- Is DDIO always beneficial?
- Scaling up and DDIO.

We study the performance of DDIO in different scenarios

# Our Key Findings (1/2)

- If an application is I/O bound, adding excessive cores could degrade its performance.
- If an application is I/O bound, tuning a little-discussed register called **IIO LLC WAYS** could improve performance and lead to the same improvements as adding more cores.
- If an application starts to become CPU bound, adding more cores could improve its throughput, but it is important to balance load among cores to maximize DDIO's benefits.
- Getting close to ~100 Gbps can cause DDIO to become a bottleneck. Therefore, it is essential to know when to bypass the cache to realize performance isolation.



• If an application is truly CPU/memory bound, tuning DDIO is less efficient.

We now explain the impact of processing time on the performance DDIO, which resulted in this finding.





100 Device under Test Hit Rate(%) 80 Input Output Packet Packet 60 **DDIO** Metric 40 Swapping Increasing processing MAC 20 Throughput time improves the Read Calling Random performance of DDIO Write Number Generator 0 50 60 70 80 90 100 30 40 10 20 (std::mt1993) 0 Number of Calls

Increasing processing time improves DDIO performance





Device under Test





DDIO performance **matters most** when an application is **I/O bound**, rather than CPU/memory bound.



• DCA/DDIO needs to be rearchitected for multi-hundred-gigabit networks.



https://github.com/aliireza/ddio-bench







Swedish Foundation for Strategic Research







# Thanks for listening

Do not hesitate to contact us if you have any questions.

farshin@kth.se and amirrsk@kth.se