

ERICSSON

## Make the Most out of Last Level Cache in Intel Processors

### <u>Alireza Farshin</u><sup>\*</sup>, Amir Roozbeh<sup>\*+</sup>, Gerald Q. Maguire Jr.<sup>\*</sup>, Dejan Kostić<sup>\*</sup>

\* KTH Royal Institute of Technology (EECS/COM) + Ericsson Research































For a CPU that is running at 3.2 GHz, every 4 cycle is around 1.25 ns.





For a CPU that is running at 3.2 GHz, every 4 cycle is around 1.25 ns.





For a CPU that is running at 3.2 GHz, every 4 cycle is around 1.25 ns.



## **Better Cache Management**







## Non-uniform Cache Architecture (NUCA)



Since Sandy Bridge (~2011), LLC is not unified any more!

## Non-uniform Cache Architecture (NUCA)



#### Intel's Complex Addressing

Determines the mapping between memory address space and LLC Slices.

Almost every cache line (64 B) maps to a different LLC slice.

#### Known Methods: Clémentine Maurice et al. [RAID '15]\*

Performance Counters

\* Clémentine Maurice, Nicolas Scouarnec, Christoph Neumann, Olivier Heen, and Aurélien Francillon. 2015. Reverse Engineering Intel Last-Level Cache Complex Addressing Using Performance Counters.

## Measuring Access Time to LLC Slices



# Different access time to different LLC slices

## Measuring Access Time to LLC Slices



Measuring Read Access Time from Core 0 to all LLC slices





Accessing the **closer** LLC slice can save up to  $\sim$ 20 cycles, i.e., 6.25 ns.



For a CPU that is running at 3.2 GHz.



Allocate memory from physical memory in a way that it maps to the appropriate LLC slice(s).





DRAM

- Use Cases:
- Isolation



- Use Cases:
- Isolation
- Shared Data



- Use Cases:
- Isolation
- Shared Data
- Performance



### Use Cases:

- Isolation
- Shared Data
- Performance

Every core is associated to its closest LLC slice.







### There are many applications that have this characteristic.

There are many applications that have this characteristic.

There are many applications that have this characteristic.

Virtualized Network Functions

Packet's Header

There are many applications that have this characteristic.



#### Can fit into a slice

There are many applications that have this characteristic.

Key-Value Stores Virtualized Network Functions

We focus on virtualized network functions in this talk!



## **CacheDirector**

## A network I/O solution which extends Data Direct I/O (DDIO) by employing Slice-aware Memory Management



#### \* Direct Memory Access (DMA)

## Data Direct I/O (DDIO)



# Sending/Receiving Packets via DDIO

# DMA\*-ing packets directly to LLC rather than DRAM.



#### \* Direct Memory Access (DMA)

## Data Direct I/O (DDIO)



Packets go to random slices!

## Data Direct I/O (DDIO)



Packets go to random slices!











- Sends packet's header to the <u>appropriate</u> LLC slice.
- Implemented as a part of userspace NIC drivers in the Data Plane Development Kit (DPDK).
- Introduces dynamic headroom in DPDK data structures.







Packet Generator

### Device under Test Running VNFs









Packet Generator Actual Campus Trace Device under Test Running VNFs

Intel Xeon E5 2667 v3





\* Georgios P.Katsikas, Tom Barbette, Dejan Kostic, Rebecca Steinert, and Gerald Q. Maguire Jr. 2018. Metron: NFV Service Chains at the True Speed of the Underlying Hardware.



### Stateful NFV Service Chain



Achieved Throughput ~76 Gbps



### Stateful NFV Service Chain







Stateful NFV Service Chain **Evaluation — 100 Gbps** KTH NAPT Router Load Balancer 100 ≈ 119 μş CacheDirector Achieved Throughput + DDIO ~76 Gbps **Traditional DDIO** 80 **21.5%** Faster access to 60 CDF packet header Improvement 40 Faster processing 20 time per packet 0 Reduce queueing 800 600 1000 200 400 0 Latency ( $\mu$ s) time



### Stateful NFV Service Chain





\* Service Level Objective (SLO)



- More NFV results
- Slice-aware key-value store
- Portability of our solution on Skylake architecture
- Slice Isolation vs. Cache Allocation Technology (CAT)
- More ...





- Hidden opportunity that can decrease average access time to LLC by  ${\sim}20\%$
- Useful for other applications



https://github.com/aliireza/slice-aware

• Meet us at the poster session



This work is supported by WASP, SSF, and ERC.



# Backup



- Intel Xeon Gold
  6134 (Skylake)
- Mesh architecture
- 8 cores and 18 slices
- Non-inclusive LLC
- Does not affect DDIO





- IPv4: 14 B (Ethernet) + 20 B (IPv4) + 20 B (TCP) < 64 B</li>
- IPv6: 14B (Ethernet) + 36 B (IPv6) + 20 B (TCP) > 64 B

Any 64 B of the packet can be placed in the appropriate slice

# Limitations and Considerations

- Data larger than 64 B
- Using linked-list and scatter data
- Future H/W features:
  - Bigger chunks (e.g., 4k pages)
  - Programmable

• Slice Imbalance

Limiting our application to smaller portion of LLC, but with faster access.



- NUCA
- Cache-aware Memory Management (e.g., Partitioning and Page Coloring)
- Extending CacheDirector for the whole packet
- Slice-aware Hypervisor

## Slice-aware Memory Management





## **Evaluation — Low Rate**



### Simple Forwarding Application

1000 Packets/s



