



# MEMORY PREFETCHING EVALUATION USING GEM5 SIMULATIONS

NAM HO, CARLOS FALQUEZ, ANTONIO PORTERO, ESTELA SUAREZ,

NOVEL SYSTEM ARCHITECTURE DESIGN, JÜLICH SUPERCOMPUTING CENTRE, FORSCHUNGSZENTRUM JÜLICH

**DIRK PLEITER** 

PDC CENTER FOR HIGH PERFORMANCE COMPUTING, KTH ROYAL INSTITUTE OF TECHNOLOGY

Copyright © European Processor Initiative 2023. ISC23-AHUG Workshop/N.Ho/Hamburg/May 25, 2023



### OUTLINE

- Motivation
- Hardware memory prefetching
- Simulation methodology
- Application selection and memory access pattern analysis
- Prefetch evaluation
- Conclusion & future study



# MOTIVATION

- Emerging HPC applications are demanding advanced computing systems
  - Need both high processing capability as well as high memory bandwidth
- Recent innovations in the HPC sector
  - Arm-based high-end processors with SVE technology have entered the HPC sector (e.g. Fugaku supercomputer)
  - New memory technologies (e.g. HBMx) make available high bandwidths
- The memory wall
  - Scaling memory bandwidth with processing capability remains a challenge
  - Memory prefetching is well-known technique to hide long-latency memory accesses by predicting the data accesses and preloading data ahead in the cache that is expected to be requested in the near future
- Work objectives
  - Investigate the effect of hardware prefecting techniques on scientific applications in recent Arm-based high-end processors
  - Leverage gem5 capabilities and developed a gem5-model of the latest Arm Neoverse V1 design with HBM2 based on the recently released Graviton3



# HARDWARE MEMORY PREFETCHING

- A well-known technique to eliminate processor stalls due to long-latency memory accesses
  - Reduce cache misses by predicting the data accesses and fetching ahead data into cache that is expected to be requested in the near future
- Memory access pattern prediction
  - Stride (or delta) access pattern
    - Constant stride sequence: [A,A+k,A+2k,...]
    - Multi-delta sequence: [A,A+k,A+m,...B,B+k,B+m]
  - Address repetition pattern
    - [A,B,C,...,A,B,C]

- Prefetch effectiveness
  - Depend on prediction accuracy, prefetch coverage
  - Aggressive prefetching

\* Sparsh Mittal. 2016. A Survey of Recent Prefetching Techniques for Processor Caches. ACM Comput. Surv. 49, 2, Article 35 (Nov. 2016).



## SIMULATION METHODOLOGY: THE GEM5-MODEL OF NEOVERSE V1 ARCHITECTURE

- The 16-core Neoverse V1 architecture (ExpArch)
  - NoC:
    - Ruby Garnet + AMBA–CHI
    - 4x4 Mesh
    - 4 physical link channels for 4 VNETs
  - 16 cores + 16x1MB-SLCs
    - 2xSVE engines per core
    - Core's clock: 2.4GHz
    - System/NoC clock: 2.0GHz
  - Memory
    - HBM2 (8 x 38.4GB/s/channels)



North West quadrant of the modeled architecture: The left-most figure shows 16 cores, 16 SLC slices, and 8 HBM2 channels placed in a 2D mesh topology; The middle of the figure outlines a router architecture connecting to 2 cores and 2 SLC slides; The right most figure outlines core's cache hierarchy

Lilia Zaourar et.al. "Multilevel simulation-based co-design of next generation HPC microprocessors." PMBS 2021: 18-29



(8x38.4GB/s/channel)

# SIMULATION METHODOLOGY: QUANTITATIVE COMPARISON WITH REAL PLATFORMS

- To get better confidence in the gem5model, we quantitatively compared the simulation results with N1SDP, and Graviton3
  - Two additional gem5-models
    - Gem5-model (N1) for N1SDP,
    - Gem5-model (G3) for Graviton3

| <b>Overview of three system architectures</b> |                |          |                              |  |  |  |  |  |
|-----------------------------------------------|----------------|----------|------------------------------|--|--|--|--|--|
|                                               | Core-Arch      | NoC      | Memory                       |  |  |  |  |  |
| Graviton3                                     | 64xNeoverse-VI | AMBA-CHI | DDR5<br>(8x38.4GB/s/channel) |  |  |  |  |  |
| NISDP                                         | 4xNeoverse-N1  | AMBA-CHI | DDR4<br>(2x25.6GB/s/channel) |  |  |  |  |  |
| ExpArch                                       | 16xNeoverse-V1 | AMBA-CHI | HBM2-Model                   |  |  |  |  |  |

| FMLA-bench results   |         |      |       |        |              |          |          |          |  |
|----------------------|---------|------|-------|--------|--------------|----------|----------|----------|--|
|                      | Work    |      | CPU   |        | Flop/cycle   |          | Retired  | SIMD     |  |
| Platform             | size    | Loop | clock | Glop/s | (% of peak)  | Cycles   | instrs   | instrs   |  |
| gem5-model (G3)      | 1000000 | 5    | 2.6   | 41.5   | 15.96 (99.7) | 7.51E+07 | I.8E+08  | I.5E+08  |  |
| Graviton3            | 1000000 | 5    | 2.6   | 41.5   | 15.96 (99.8) | 7.76E+07 | I.8E+08  | I.5E+08  |  |
| gem5-model (N1)      | 1000000 | 5    | 2.6   | 20.8   | 8 (100)      | I.50E+07 | 3.30E+07 | 3.00E+07 |  |
| NISDP                | 1000000 | 5    | 2.6   | 20.8   | 8 (100)      | I.50E+07 | 3.30E+07 | 3.00E+07 |  |
| gem5-model (ExpArch) | 1000000 | 5    | 2.4   | 38.3   | 15.96 (99.7) | 7.51E+07 | I.8E+08  | 1.5E+08  |  |

### Simulation results match with that on the real platforms

### Microbenchmarking

 FMLA-bench: A simple benchmark developed to test the maximum number of FP operations on an SVE machine



### SIMULATION METHODOLOGY: QUANTITATIVE COMPARISON WITH REAL PLATFORMS (CONT.)

- To get better confidence in the gem5model, we quantitatively compared the simulation results with N1SDP, and Graviton3
- Microbenchmarking
  - Triad-kernel (from STREAM benchmark)

a[i] = S\*b[i] + c[i]

### NISDP comparison with prefetchers disabled

| Read-only kernel - With prefetchers disabled |                         |                          |                          |                   |                     |                     |  |  |  |  |
|----------------------------------------------|-------------------------|--------------------------|--------------------------|-------------------|---------------------|---------------------|--|--|--|--|
| Platform                                     | LID cache<br>miss       | L2 cache<br>miss         | SLC cache<br>miss        | Retired<br>instrs | LID cache<br>access | Bandwidth<br>(GB/s) |  |  |  |  |
| gem5-model (N1)                              | 4.00E+07                | 5.03E+06                 | 5.03E+06                 | I.60E+08          | 4.04E+07            | 1.83                |  |  |  |  |
| NISDP                                        | 5.00E+06 <sup>(*)</sup> | 5.00E+06                 | 5.00E+06                 | I.60E+08          | 4.00E+07            | 1.77                |  |  |  |  |
|                                              | Triad kernel            | - With prefetcl          | ners disabled            |                   |                     |                     |  |  |  |  |
| gem5-model (N1)                              | 8.61E+07                | 1.51E+07                 | 1.51E+07                 | 2.80E+08          | 1.21E+08            | 2.95                |  |  |  |  |
| NISDP                                        | 1.00E+07 <sup>(*)</sup> | 1.00E+07 <sup>(**)</sup> | 1.00E+07 <sup>(**)</sup> | 2.80E+08          | I.20E+08            | 2.57                |  |  |  |  |
| gem5-model (NI) (Stride=4)                   | 2.50E+07                | 1.51E+07                 | I.50E+07                 | 7.01E+07          | 3.10E+07            | 2.15                |  |  |  |  |
| NISDP (Stride=4)                             | I.50E+07                | 1.50E+07                 | I.50E+07                 | 7.00E+07          | 3.00E+07            | 1.68                |  |  |  |  |

(\*) Diff for a factor 8x due to cache-line refill counting in NISDP (\*\*) Diff. for a factor of 1.5x due to the effect of write-streaming mode in NISDP

### Three platforms comparison with prefetchers enabled

| Platform        | l-thread<br>(GB/s) | 4-threads<br>(GB/s) | l 6-threads<br>(GB/s) |
|-----------------|--------------------|---------------------|-----------------------|
| gem5-model (N1) | 13.1               | 30.83               | -                     |
| NISDP           | 19.72              | 29.29               | -                     |
|                 |                    |                     |                       |
| gem5-model (G3) | 39.11              | 115.75              | 173.66                |
| Graviton3       | 44.86              | 162.04              | 214.04                |
| ExpArch         | 38.01              | 117.16              | 169.04                |

Read-only kernel

sum += a[i]



# SELECTED HPC APPLICATION KERNELS & THE CHARACTERIZATION

|                  | Description                                                                                                                                                                                                                                   | (Pseudocode)<br>Kernel code (via SVE optimization)                                                                                  | Access pattern                                         |  |
|------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------|--|
| TRIAD            | The triad kernel from STREAM benchmark                                                                                                                                                                                                        | // Triad<br>for i = 0; i < N; i++)<br>a[i] <- S* b[i] + c[i]                                                                        | Constant stride                                        |  |
| SPMV<br>(MiniFE) | Matrix-vector product $y = y + Ax$ , where x and y are dense vectors<br>We evaluated the hand-optimized version using SVE intrinsics (*)<br>(*) Bine Brank et.al. "Porting Applications to Arm-based Processors. In 2020<br>CLUSTER. 559–566. | <pre>// Sliced ELLPACK format for (s = 0; s &lt; N; s += S) for (r = s; r &lt; s + S; r++) for (i = slice_start[s/S] + r - s;</pre> | Constant stride + random access with indirect indexing |  |
| waLBerla         | A framework for computational fluid dynamics simulations based on the LBM. We used the waLBerla use-case <i>UniformGridBenchmark</i>                                                                                                          | // Stencil code<br>D3Q19 model implementation                                                                                       | Multi-stride                                           |  |
| RTM              | Reverse-Time Migration (RTM) is a well known method used in the reconstruction of geological subsurface structures.<br>We evaluated Isotropic wave equation (ISO) kernel                                                                      | // Stencil code<br>8th order Laplacian Stencil,<br>a 3D 27-point stencil                                                            | Multi-stride                                           |  |

|           |                         |                     |              |               | Memory _  | Miss-r | ate(%) | MPI   | KI    |       |          |
|-----------|-------------------------|---------------------|--------------|---------------|-----------|--------|--------|-------|-------|-------|----------|
|           | I <sub>mem</sub>        | l <sub>fp</sub>     | AI=          |               | footprint |        |        |       |       |       | BW Util. |
| Benchmark | (Byte)                  | (Flop)              | (flops/byte) | Size          | (MiB)     | L2     | SLC    | L2    | SLC   | IPC   | (%peak)  |
|           |                         |                     |              |               |           |        |        |       |       |       |          |
| waLBerla  | N.76.8                  | N.189               | 0.31         | N=96.96.96    | 256       | 97.53  | 87.57  | 46.88 | 41.05 | 10.93 | 31.71    |
| SpMV      | N <sub>nnz</sub> .(8+4) | N <sub>nnz</sub> .2 | 0.17         | N=63.63.63    | 80        | 91.3   | 92.82  | 58. I | 53.92 | 8. I  | 23.6     |
| RTM       | 16.2                    | 34                  | 2.1          | N=128.256.256 | 128       | 93.39  | 44.12  | 41.17 | 18.16 | 12.41 | 13.57    |

Copyright © European Processor Initiative 2023. ISC23-AHUG Workshop/N.Ho/Hamburg/May 25, 2023



# HPC APPLICATION KERNELS & CHARACTERIZATION (CONT.)



- \* STRIDE pattern (Spatial Locality [1]):
- Calculation by searching back for a window of w memory accesses and calculate stride s as the minimum distance between the current access address and its nearest neighbor among those in the window.
- $\mathbf{w} = 64$  (units in 64-byte cache-line).



#### \* REUSE pattern (Temporal Locality<sup>[1]</sup>):

- Calculation by searching back for a window of L memory accesses and calculate the number of times access address repeated
- L = 64 (units in 64-byte cache-line).

[1] J. Weinberg, M. O. McCracken, E. Strohmaier and A. Snavely, "Quantifying Locality In The Memory Access Patterns of HPC Applications," SC '05, pp. 50-50, doi: 10.1109/SC.2005.59.



### HARDWARE PREFETCHERS USED FOR EVALUATION

| Hardware prefetcher configuration at L2 for evaluation |                            |                   |             |            |                                                                                                                       |                                   |  |  |
|--------------------------------------------------------|----------------------------|-------------------|-------------|------------|-----------------------------------------------------------------------------------------------------------------------|-----------------------------------|--|--|
| HW Prefetcher (*)                                      | Description                | Trigger on access | Cache snoop | Queue size | Specific parameters                                                                                                   | Aggressive prefetching parameters |  |  |
| Tagged                                                 | Next-line prefetching      |                   |             |            | -                                                                                                                     |                                   |  |  |
| Stride                                                 | tride detection            |                   |             |            | table entries = 128                                                                                                   |                                   |  |  |
| AMPM <sup>[1]</sup>                                    | Delta-pattern<br>detection | -                 | -           |            | hot zone size=2KiB,<br>map table entries = 256                                                                        |                                   |  |  |
| SPP <sup>[2]</sup>                                     | Delta-pattern<br>detection | Irue              | Irue        | 64         | prefetch threshold = 0.25,<br>lookahead threshold=0.01,<br>signature table entries=256,<br>pattern table entries=4096 | Varying prefetch degree<br>[132]  |  |  |

(\*) Prefetcher names are taken from the gem5

[1] Ishii, Y. et. al. "Access map pattern matching for high performance data cache prefetch" Journal of Instruction-Level Parallelism, 13, 1-24 (2011). [2] Kim J. et. al. "Path confidence based lookahead prefetching" MICRO-49, Article 60, 12 (2016).



### **PREFETCH COUNTERS & METRICS**

- *Number of prefetches*: A counter tracks number of prefetches actually sent to the next memory level.
- Late prefetches: A counter tracks number of accurate prefetch requests are being in the in-flight memory request buffer (TBE) waiting for the returned data by the time needed by the demand miss.
- *Timely prefetches*: A counter tracks number of accurate prefetch requests fetching the needed data by a demand miss in time.
- Useless prefetches: A counter tracks number of inaccurate prefetches, for which the fetched data in cache are not used.
- Accuracy:  $\frac{timely \ prefetches+late \ prefetches}{number \ of \ prefetches}$

- **Coverage:**  $\frac{timely \ prefetches+late \ prefetches}{timely \ prefetches+late \ prefetches+demand \ misses}$



# **EFFECTS OF AGGRESSIVE PREFETCHING**





Triad kernel (STREAM benchmark)

- BW utilization saturated without prefetching (74% of peak; 16-thread evaluation)
- High aggressiveness shows performance improvement for single-thread



### SPP exploration with diff. threshold configurations

Copyright © European Processor Initiative 2023. ISC23-AHUG Workshop/N.Ho/Hamburg/May 25, 2023



# EFFECTS OF AGGRESSIVE PREFETCHING (CONT.)

- WaLBerla
  - Strongly benefits from aggressive prefetching
    - Low conservative prefetching (e.g. degree = 1) does not work well (~50% late prefetches)
    - Prefetching with degree = 32, all prefetchers show coverage of over 80%, and the Next-line (TAG) delivers speedup up to 1.29
  - More useless prefetches are observed with the Next-line (TAG)





 $12_{miss}(D)$  slc\_access(P+D) slc\_miss(P+D)  $\Leftrightarrow$  speedup = BW Util.

6-threads



# **PERFORMANCE IMPROVEMENT**

- All selected prefetchers are effective showing speedups that highly correlate with the classified prefetches
- Among prefetchers, Next-line provides the highest effectiveness, although, this prefetcher shows the highest useless prefetches
- SpMV gains the most up to a speedup of 2.45×

### 16-thread evaluation





### SENSITIVITY ANALYSIS-EFFECT OF MEMORY BANDWIDTH



Higher speedup gains when increasing #channels (Benefit of aggressive prefetching) 15



### CONCLUSIONS

- We have evaluated the effects of memory prefetching of scientific application on high-end Arm processors with attached HBM memory using a simulation methodology based on a gem5
- Lessons learnt
  - Memory access patterns from the stencil-style and SpMV codes commonly used in today's scientific applications benefit from spatial address-correlation prefetching schemes
  - Aggressive prefetching evaluated for the kernel codes (e.g. SpMV) is sensitive to memory bandwidth: more available bandwidths lead to higher speedups
- Future studies
  - In-deep analyze memory access patterns (delta, address repetition)
  - Extend the evaluation for other HPC application kernels



Thanks for your attention!



- Acknowledgements
  - Tiago Muck (ARM Ltd.)
  - Stepan Nassyr, Bine Brank (JSC)



### **EPI PARTNERS**





### **EPI FUNDING**



This project has received funding from the European High Performance Computing Joint Undertaking (JU) under Framework Partnership Agreement No 800928 and Specific Grant Agreement No 101036168 EPI-SGA2. The JU receives support from the European Union's Horizon 2020 research and innovation programme and from Croatia, France, Germany, Greece, Italy, Netherlands, Portugal, Spain, Sweden, and Switzerland.

