2:00pm |
- |
|
Filippo Spiga (AHUG) |
Welcome and Housekeeping |
2:00pm |
20 |
Simon McIntosh-Smith |
University of Bristol |
An update on the GW4 Isambard 3 Arm-based supercomputer
Abstract
The GW4 Isambard supercomputer was the first production Arm-based system when it went live in the spring of 2018. Having already gone through two generations of Arm technology, Isambard 3, due to launch at the end of 2023, will be based on NVIDIA's new Grace CPUs. Isambard 3 will deliver 5-6 times the performance of Isambard 2, while using only 20% more power. In this talk we will describe the new system, as well as giving an update on the progress of Isambard's multi-year mission to port and optimise codes to the Arm architecture.
|
2:20pm |
20 |
Nam Ho |
Julich |
Memory Prefetching Evaluation using gem5 Simulations
Abstract
Significantly increased memory bandwidth is increasingly difficult to exploit in standard multicore CPU architectures. Memory prefetchers play an important role in hiding memory access latencies and ensuring sufficiently high memory-level parallelism. In this talk, we report on ongoing efforts for exploring their impact on various HPC benchmarks and mini-apps that implement performance-critical kernels of Lattice Boltzmann Method, finite element, and reverse time migration methods. Using modern Arm core cores we explore the performance impact of different memory prefetchers solutions and configurations that have been implemented in the gem5 simulator.
|
2:40pm |
20 |
Carlos Falquez |
Julich |
Studying different BFS algorithm implementations with gem5
Abstract
To leverage different CPU features, different implementations of the breadth-first search (BFS) algorithm have been proposed. The gem5 simulator provides the opportunity to investigate how these implementations exploit different CPU configurations. For our study, we assume a modern Arm processor core like Neoverse V1 and report on the impact of different SVE pipelines, cache, network-on-chip, and memory configurations.
|
3:00pm |
20 |
Chen Liu |
Clarkson |
High Frequency Performance Monitoring via Architectural Event Measurement
Abstract
Obtaining detailed software execution information via performance monitoring counters is a powerful analysis technique. Performance counters provide an effective method to monitor program behaviors; hence performance bottlenecks due to hardware architecture or software design and implementation can be identified, isolated and improved on. The granularity and overhead of the monitoring mechanism, however, are paramount to proper analysis. Many prior designs have been able to provide performance counter monitoring with inherited drawbacks such as intrusive code changes, a slow timer system, or the need for a kernel patch. In this session, we introduce K-LEB (Kernel - Lineage of Event Behavior), a new monitoring mechanism that can produce precise, non-intrusive, low-overhead, periodic performance counter data, and support ARM processors. In this talk, we will discuss the performance counter profiling tools design choice and implementation and how to utilize the performance monitoring counter for the low-cost software analysis and its applications.
|
3:20pm |
20 |
Luka Stanisic |
Huawei |
Performance Evaluation of the Ginkgo Sparse Linear Solver Framework on Arm
Abstract
The Ginkgo linear algebra library provides a set of preconditioners and iterative solvers for sparse systems. Ginkgo receives attention for supporting accelerators, but also targets CPUs with OpenMP kernels that we focus on. We characterize the behavior of Ginkgo’s benchmarks (SpMV with 5 formats, matrix conversions, 7 solvers, 9 preconditioners) wrt. hot kernels, top-down analysis, roofline model and working set, along with OpenMP imbalance, pragma use and thread placement on one AArch64 Huawei Kunpeng920 system using GNU GCC and 10 matrices from SuiteSparse. Selected results are complemented with evaluations on AWS Graviton3, 3rd-gen Intel Xeon and AMD EPYC. We offer guidance for optimizing the OpenMP executor by identifying tuning opportunities and pitfalls. The solvers are in the memory-bound region of the roofline model with arithmetic intensity from 0.1 to 3, leading to a FLOPS efficiency below 1%. Scalability is limited by OpenMP imbalance of up to 43% for selected usecases.cThe Ginkgo linear algebra library provides a set of preconditioners and iterative solvers for sparse systems. Ginkgo receives attention for supporting accelerators, but also targets CPUs with OpenMP kernels that we focus on. We characterize the behavior of Ginkgo’s benchmarks (SpMV with 5 formats, matrix conversions, 7 solvers, 9 preconditioners) wrt. hot kernels, top-down analysis, roofline model and working set, along with OpenMP imbalance, pragma use and thread placement on one AArch64 Huawei Kunpeng920 system using GNU GCC and 10 matrices from SuiteSparse. Selected results are complemented with evaluations on AWS Graviton3, 3rd-gen Intel Xeon and AMD EPYC. We offer guidance for optimizing the OpenMP executor by identifying tuning opportunities and pitfalls. The solvers are in the memory-bound region of the roofline model with arithmetic intensity from 0.1 to 3, leading to a FLOPS efficiency below 1%. Scalability is limited by OpenMP imbalance of up to 43% for selected usecases.
|
3:40pm |
20 |
Gilles Tourpe |
AWS |
Hackathon with TERATEC - Feedback
Abstract
AWS, ARM, UCit and TERATEC organized a Hackathon for HPC masters universities. 10 teams competed on porting a stencil code (contributed by CGG) and code_saturne (contributed by EDF R&D) on AWS graviton3 instances. The talk proposal is to report on the learnings. This talk will be supported by AWS and ARM.
|
4:00pm |
30 |
Coffee Break |
4:30pm |
20 |
Brendan Bouffler |
AWS |
Updates from the field on Graviton 3E for HPC
Abstract
Graviton 3E was engineered specifically for HPC customers and we've also launched an Hpc7g instance family, based on this processor, coupled with 200 Gb/s of Elastic Fabric Adapter. We'll explain how this works, how to get access to these using HPC tooling, and show the performance results we're seeing - contributed by customers.
|
4:50pm |
20 |
Etienne Renault |
SiPearl |
Evaluation and performance projections for ARM chips
Abstract
The variety of ARM chips (and SOC) that are used in the HPC realm make it difficult to anticipate performances on other ARM based architectures. This talk compares the relatives performances of ARM chips on different HPC benchmarks and shows some strategies to anticipate results with a variety of different configurations.
|
5:10pm |
20 |
Filippo Spiga |
NVIDIA |
Accelerating time-to-science with the NVIDIA Superchip platform
Abstract
The present talk will highlight how the NVIDIA Superchip platform (Grace Superchip and Grace Hopper Superchip), enabled by Arm Neoverse IP, is enabling HPC users and developers to achieve a better time-to-solution and energy-to-solution in their scientific workflows. Performance numbers will be disclosed and discussed, as well as the latest updates on the product go-to-market options.
|
5:30pm |
20 |
Beau Paisley |
Linaro |
An Analysis of Arm Graviton systems Using Linaro Performance Reports
Abstract
In this presentation we will give an overview of Linaro Performance Reports, an application profiler for HPC applications. We will use the tool to create a case study for analyzing various configurations of the WRF weather code on various configurations of AWS Graviton systems.
|
5:50pm |
10 |
|
Miwako Tsuji (AHUG) |
Wrap-up session and announcements |