Arm HPC User Group Workshop @ ISC23

An ISC 2023 Workshop

This project is maintained by arm-hpc-user-group

ISC23 AHUG Workshop

to be held in conjunction with ISC High Performance 2023 (https://www.isc-hpc.com/) in Hamburg, Germany. The AHUG workshop is scheduled on 25th May, 2:00pm-6:00pm.

Alt text

Date and Time

25th May 2023
2:00pm-6:00pm

Place

Hall Y11 - 2nd Floor, Congress Center Hamburg (CCH), Germany

Timetable and agenda

Speaker Institution Title (slides)
Filippo Spiga (AHUG) Welcome and Housekeeping
Simon McIntosh-Smith University of Bristol An update on the GW4 Isambard 3 Arm-based supercomputer
Abstract The GW4 Isambard supercomputer was the first production Arm-based system when it went live in the spring of 2018. Having already gone through two generations of Arm technology, Isambard 3, due to launch at the end of 2023, will be based on NVIDIA's new Grace CPUs. Isambard 3 will deliver 5-6 times the performance of Isambard 2, while using only 20% more power. In this talk we will describe the new system, as well as giving an update on the progress of Isambard's multi-year mission to port and optimise codes to the Arm architecture.
Nam Ho Julich Memory Prefetching Evaluation using gem5 Simulations
Abstract  Significantly increased memory bandwidth is increasingly difficult to exploit in standard multicore CPU architectures. Memory prefetchers play an important role in hiding memory access latencies and ensuring sufficiently high memory-level parallelism. In this talk, we report on ongoing efforts for exploring their impact on various HPC benchmarks and mini-apps that implement performance-critical kernels of Lattice Boltzmann Method, finite element, and reverse time migration methods. Using modern Arm core cores we explore the performance impact of different memory prefetchers solutions and configurations that have been implemented in the gem5 simulator.
Carlos Falquez Julich Studying different BFS algorithm implementations with gem5
Abstract  To leverage different CPU features, different implementations of the breadth-first search (BFS) algorithm have been proposed. The gem5 simulator provides the opportunity to investigate how these implementations exploit different CPU configurations. For our study, we assume a modern Arm processor core like Neoverse V1 and report on the impact of different SVE pipelines, cache, network-on-chip, and memory configurations.
Chen Liu Clarkson High Frequency Performance Monitoring via Architectural Event Measurement
Abstract  Obtaining detailed software execution information via performance monitoring counters is a powerful analysis technique. Performance counters provide an effective method to monitor program behaviors; hence performance bottlenecks due to hardware architecture or software design and implementation can be identified, isolated and improved on. The granularity and overhead of the monitoring mechanism, however, are paramount to proper analysis. Many prior designs have been able to provide performance counter monitoring with inherited drawbacks such as intrusive code changes, a slow timer system, or the need for a kernel patch. In this session, we introduce K-LEB (Kernel - Lineage of Event Behavior), a new monitoring mechanism that can produce precise, non-intrusive, low-overhead, periodic performance counter data, and support ARM processors. In this talk, we will discuss the performance counter profiling tools design choice and implementation and how to utilize the performance monitoring counter for the low-cost software analysis and its applications.
Luka Stanisic Huawei Performance Evaluation of the Ginkgo Sparse Linear Solver Framework on Arm
Abstract  The Ginkgo linear algebra library provides a set of preconditioners and iterative solvers for sparse systems. Ginkgo receives attention for supporting accelerators, but also targets CPUs with OpenMP kernels that we focus on. We characterize the behavior of Ginkgo’s benchmarks (SpMV with 5 formats, matrix conversions, 7 solvers, 9 preconditioners) wrt. hot kernels, top-down analysis, roofline model and working set, along with OpenMP imbalance, pragma use and thread placement on one AArch64 Huawei Kunpeng920 system using GNU GCC and 10 matrices from SuiteSparse. Selected results are complemented with evaluations on AWS Graviton3, 3rd-gen Intel Xeon and AMD EPYC. We offer guidance for optimizing the OpenMP executor by identifying tuning opportunities and pitfalls. The solvers are in the memory-bound region of the roofline model with arithmetic intensity from 0.1 to 3, leading to a FLOPS efficiency below 1%. Scalability is limited by OpenMP imbalance of up to 43% for selected usecases.cThe Ginkgo linear algebra library provides a set of preconditioners and iterative solvers for sparse systems. Ginkgo receives attention for supporting accelerators, but also targets CPUs with OpenMP kernels that we focus on. We characterize the behavior of Ginkgo’s benchmarks (SpMV with 5 formats, matrix conversions, 7 solvers, 9 preconditioners) wrt. hot kernels, top-down analysis, roofline model and working set, along with OpenMP imbalance, pragma use and thread placement on one AArch64 Huawei Kunpeng920 system using GNU GCC and 10 matrices from SuiteSparse. Selected results are complemented with evaluations on AWS Graviton3, 3rd-gen Intel Xeon and AMD EPYC. We offer guidance for optimizing the OpenMP executor by identifying tuning opportunities and pitfalls. The solvers are in the memory-bound region of the roofline model with arithmetic intensity from 0.1 to 3, leading to a FLOPS efficiency below 1%. Scalability is limited by OpenMP imbalance of up to 43% for selected usecases.
Gilles Tourpe AWS Hackathon with TERATEC - Feedback
Abstract  AWS, ARM, UCit and TERATEC organized a Hackathon for HPC masters universities. 10 teams competed on porting a stencil code (contributed by CGG) and code_saturne (contributed by EDF R&D) on AWS graviton3 instances. The talk proposal is to report on the learnings. This talk will be supported by AWS and ARM.
Coffee Break
Brendan Bouffler AWS Updates from the field on Graviton 3E for HPC
Abstract  Graviton 3E was engineered specifically for HPC customers and we've also launched an Hpc7g instance family, based on this processor, coupled with 200 Gb/s of Elastic Fabric Adapter. We'll explain how this works, how to get access to these using HPC tooling, and show the performance results we're seeing - contributed by customers.
Etienne Renault SiPearl Evaluation and performance projections for ARM chips
Abstract  The variety of ARM chips (and SOC) that are used in the HPC realm make it difficult to anticipate performances on other ARM based architectures. This talk compares the relatives performances of ARM chips on different HPC benchmarks and shows some strategies to anticipate results with a variety of different configurations.
Filippo Spiga NVIDIA Accelerating time-to-science with the NVIDIA Superchip platform
Abstract The present talk will highlight how the NVIDIA Superchip platform (Grace Superchip and Grace Hopper Superchip), enabled by Arm Neoverse IP, is enabling HPC users and developers to achieve a better time-to-solution and energy-to-solution in their scientific workflows. Performance numbers will be disclosed and discussed, as well as the latest updates on the product go-to-market options.
Beau Paisley Linaro An Analysis of Arm Graviton systems Using Linaro Performance Reports
Abstract  In this presentation we will give an overview of Linaro Performance Reports, an application profiler for HPC applications. We will use the tool to create a case study for analyzing various configurations of the WRF weather code on various configurations of AWS Graviton systems.
Miwako Tsuji (AHUG) Wrap-up session and announcements

Organizers