



# Impact of Write-Allocate Elimination on Fujitsu A64FX

#### <u>Yan Kang</u>

The Pennsylvania State University Pennsylvania, USA

Mahmut Kandemir The Pennsylvania State University Pennsylvania, USA Sayan Ghosh Pacific Northwest National Laboratory Washington, USA

Andrés Marquez Pacific Northwest National Laboratory Washington, USA









#### Write-Allocate Avoidance

ege of Engineering

**Evasion**: [Hardware detects if cache line is going to be overwritten] store cache line directly in memory (Intel, non-temporal stores, compiler hints or automatic SpecI2M)

**Elimination:** [Hardware detects if cache line is going to be overwritten] directly write an L2 cache line with zeroes, processor loads cache line avoiding memory read

**Fujitsu A64FX:** Elimination is available through a special 64-bit instruction (DC ZVA) in the ARMv8-A

Can write-allocate elimination via "zero fill" improve the performance of various applications on Fujitsu A64FX?



Read Dr. Georg Hager's blog post and paper: https://blogs.fau.de/hager/archives/8997 https://onlinelibrary.wiley.com/doi/10.1002/cpe.6512





#### Zero Filling in Fujitsu A64FX Without ZFILL



This is the memory access we are trying to eliminate!

#### "zero fill" on L2 Cache: Upon receiving the DC ZVA request, the L2 cache secures the cache line corresponding to the specified virtual address and writes zero data

"zero fill" on L1 Cache:

zero data is written after data in the L1 cache is written back to the L2 cache.















#### Benchmarking decisions





- Does not represent irregular cases, most applications
- □ Graphs irregular memory accesses
  - Applications perform repetitive *neighborhood accesses*
- NEVE is a benchmark, like STREAM for graphs (has COPY, SUM and MAX) - |V|\*|E|\*#ops / t

|                                   | STREAM                                                                                                      | Graph Neighborhood Access Kernels                                           |                                                                                                                                                                                                                                                                                |  |
|-----------------------------------|-------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| name                              | kernel                                                                                                      | name                                                                        | kernel                                                                                                                                                                                                                                                                         |  |
| COPY:<br>SCALE:<br>SUM:<br>TRIAD: | $\begin{array}{l} a(i) = b(i) \\ a(i) = q^*b(i) \\ a(i) = b(i) + c(i) \\ a(i) = b(i) + q^*c(i) \end{array}$ | Neighbor Copy: <i>a</i><br>Neighbor Add: <i>b</i><br>Neighbor Max: <i>c</i> | $\begin{array}{l} (i,j) \leftarrow COPY(\forall_{i \in  V } \forall_{j \in neighbor(i)} weight(j) \\ (i) \leftarrow SUM(\forall_{i \in  V } \forall_{j \in neighbor(i)} weight(j) \\ (i) \leftarrow MAX(\forall_{i \in  V } \forall_{j \in neighbor(i)} weight(j) \end{array}$ |  |





Graph CSR access







Can return MB/s!

Sayan Ghosh, Nathan R. Tallent, and Mahantesh Halappanavar. "Characterizing performance of graph neighborhood communication patterns." *IEEE Transactions on Parallel and Distributed Systems* 33.4 (2021): 915-928.





PennState

High Performance Computing Lab

### Explicit "Zero Fill" formulation for graph neighborhood accesses







### Explicit "Zero Fill" formulation for graph neighborhood accesses



"Zero filling" may not yield performance benefits for irregular graph workloads unless the number of vertices in a graph is significantly larger than the cache line size and the standard deviation of the vertex degrees is relatively low.

PennState

College of Engineering

| double sum = 0.0;                                                                                                                                                                                              | 20<br>21                         | over vertices                                                                                  |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------|------------------------------------------------------------------------------------------------|
| <pre>if (jbuf+OFFSET &lt; zfill_limit)     zfill(jbuf+OFFSET); for (int i=0; i<elems_cache_line; (int="" ++e)="" ++i)="" <="" e="jrowptr[i];" e<jrowptr[i+1];="" for="" pre="" {=""></elems_cache_line;></pre> | 22<br>23<br>24<br>25<br>26       | Invoke zero fill in strides larger than L2 prefetch distance                                   |
| <pre>sum += colidx[e].weight; } jbuf[i] = sum; } // loop over vertices } // openmp</pre>                                                                                                                       | 27<br>28<br>29<br>30<br>31<br>32 | Inner loop, where the zfill virtual address will be invoked several times (trip count unknown) |



**Computing Lab** 



#### Benchmarks and applications for evaluations

| Benchmark Scenarios                    | Tested Kernels |
|----------------------------------------|----------------|
|                                        | Сору           |
| STREAM                                 | Scale          |
|                                        | Add            |
|                                        | Triad          |
| Graph<br>Neighborhood<br>Kernels(NEVE) | Add            |
|                                        | Сору           |
|                                        | Max            |

Expecting STREAM to be the best case!









#### STREAM benchmark evaluations (GCC, ARM and FCC)









#### Loop versioning for STREAM



Enabling loop versioning option allows the compiler to perform software pipelining optimizations, bringing the overall performance close to ARM/GCC.







#### Graph benchmark evaluations (GCC, ARM and FCC)



- Used different graphs implies different structure/work-per-loop
- ZFILL: degradation of up to 6% but also up to 64% improvement (FCC)
- Since the "zero fill" stride length can be greater than the median #edges for certain graphs, it can have a limited impact and, in some cases, may incur overheads
- No compiler automatic DC ZVA guaranteed







## **Graph Application Evaluations**



- Does not improve performance where there is limited work in the ZFILL section
- ~15% improvement when there is sufficient work
- No compiler automatic DC ZVA guaranteed



Non-temporal store patterns may not positively impact overall application execution times if they are not on the critical performance path.







#### Observations

- NEVE exhibit about 2–5x performance degradation compared to STREAM
- End-to-end improvements between 5-20% for benchmarks and diverse application scenarios due to "zero fill" adaptations
- Performance improvements of up to 32% in Louvain clustering and median improvements of 5–17% in GAP PR, CC, and BC benchmarks









# Limitations of Zero Filling

• Number of vertices in a graph may not larger than the cache line size.

• Regular prefetching may benefit more for short buffer streaming write.

• Non-temporal store may not be on the critical performance path.

• Applications may not be written to exploit non-temporal stores.







#### Summary & Future Steps

Where to apply?

Demonstrated the impact of write allocate elimination on A64fx for various applications and show cased the improvement brought by "Zero fill"

Identified possible causes might lead to our observations on the varied performance improvements, such as software pipelining optimization, SIMD extensions, etc.

What exactly reasons behind each or all applications variations on A64FX ???

In what condition it maximize improvement?

In what condition it requires minimal modifications ?



Compiler automatic DC ZVA generation?





#### Acknowledgements

- PNNL LDRD Data Model-Convergence (PI: Sayan Ghosh, PNNL)
- DOE ASCR Advanced Memory to Support Artificial Intelligence for Science (AIAMS, PI: Andrés Márquez, PNNL)
- Penn State HPCL (Prof. Mahmut Kandemir)
- Ookami testbed support (Dr. Eva Siegmann and team, SBU)
- IWAHPCE'24 paper reviewers



