**Carnegie Mellon** University

### Jennifer Brana (University of Portland), Brian Schwedock (CMU), Yatin Manerkar (University of Michigan), Nathan Beckmann (CMU)

### Motivation

Rising cost of data movement

Move compute to where data resides

Cache-attached accelerators move accelerators to *within* the cache hierarchy



täkō [1] (above) is a representative system that augments each tile of a CMP with an engine and engine cache. This allows the engine to:

- Accelerate key computations
- 2. Use low-latency & fine-grained communication with the processor

To fully benefit, the engine's cache must maintain coherence with the system

### Naïve Design



*Core and engine must transfer data* through the LLC

LLC banks are often far from cores

wasteful & unnecessary data movement

Treat eL1d as additional LLC sharer

eL1D uses baseline protocol

**Results in excessive** writebacks to the LLC



# Kobold: Simplified Cache Coherence for **Cache-Attached Accelerators**

## Kobold Design



Challenge: Verification of new coherence protocols can be extremely costly

**Insight:** restricting the complexity of the accelerator to *within* a tile allows the LLC protocol to remain unchanged



Add directory to L2: Mis-direction Filter (MDF) tracks the state of the eL1D

Intra-tile coherence is maintained using MDF & new intra-tile communication

Intra-tile locality enables fast, local communication

### **Coherence Protocols**

1) Tile caches can transfer ownership without sending requests to the LLC



| Step  | L1D | eL1d | L2 /<br>MDF | L3 |
|-------|-----|------|-------------|----|
| lnit) | I   | Μ    | I / M       | Μ  |
| 1)    | I   | Μ    | I / M       | Μ  |
| 2)    | I   | Μ    | I / M       | Μ  |
| 3)    | I   | I    | I / M       | Μ  |
| 4)    | I   | I    | M / I       | Μ  |
| 5)    | Μ   | I    | M / I       | Μ  |
| 6)    | Μ   | I    | M / I       | Μ  |

2) All tile caches can share data that is tracked as exclusive in the LLC directory



| Step  | L1D | eL1d | L2 /<br>MDF | L3 |
|-------|-----|------|-------------|----|
| lnit) | I   | Μ    | I/M         | Μ  |
| 1)    | I   | Μ    | I/M         | М  |
| 2)    | I   | Μ    | I/M         | Μ  |
| 3)    | I   | S    | I/M         | Μ  |
| 4)    | I   | S    | S/M         | Μ  |
| 5)    | S   | S    | S/M         | Μ  |
| 6)    | S   | S    | S/M         | Μ  |
|       |     |      |             |    |

3) Caches coordinate responses to LLC requests so there is only one responder



| Step  | L1D | eL1<br>d | L2 /<br>MDF | L3 |
|-------|-----|----------|-------------|----|
| Init) | S   | S        | S/M         | Μ  |
| 1)    | S   | S        | S/M         | Μ  |
| 2)    | I   | I        | M/I         | Μ  |
| 3)    | I   | I        | M/I         | Μ  |
| 4)    | I   | I        | I/I         | Μ  |
| 5)    | I   | I        | I/I         | I  |

#### Kobold is not traditional hierarchal cache coherence (HCC)

*Typical HCC adds inclusive intermediate caches to* maintain cluster coherence, adding hierarchical indirection and increasing storage overhead

To avoid this, Kobold implements intra-tile coherence using the MDF and intra-tile communication, allowing *the eL1D and L2 to maintain coherence between* themselves and preserve baseline performance

Goal 3: prevent L2 cache pollution.

Generate fully concurrent protocols using the HieraGen toolset Implement and test in the täkō system

References [1] täkō: A Polymorphic Cache Hierarchy for General-Purpose Optimization of Data Movement. B. Schwedock, et al. ISCA 2022.



**MICRO 2022** Undergraduate SRC

### **Performance** Considerations

L2 cache is noninclusive of the eL1D.

**Optional eL1D optimization:** with a minor modification to the LLC protocol, we allow speculative eL1D loads

### Evaluation

We evaluate a system with a 128KB L2, 8KB eL1D, and 512KB LLC per tile

**Estimated MDF overhead of only 0.09%** of baseline (L2+LLC) area using CACTI

Verified stable state protocols using the Murphi model checker



Stable state protocol for eL1D cache controller. Red arrows represent new intra-tile messages.

### Next Steps