ISC23 AHUG Workshop

to be held in conjunction with ISC High Performance 2023 (https://www.isc-hpc.com/) in Hamburg, Germany. The AHUG workshop is scheduled on 25th May, 2:00pm-6:00pm.

Alt text

Date and Time

25th May 2023
2:00pm-6:00pm

Place

Hall Y11 - 2nd Floor, Congress Center Hamburg (CCH), Germany

Timetable and agenda

Speaker	Institution	Title (slides)
	Filippo Spiga (AHUG)	Welcome and Housekeeping
Simon McIntosh-Smith	University of Bristol	An update on the GW4 Isambard 3 Arm-based supercomputer Abstract The GW4 Isambard supercomputer was the first production Arm-based system when it went live in the spring of 2018. Having already gone through two generations of Arm technology, Isambard 3, due to launch at the end of 2023, will be based on NVIDIA's new Grace CPUs. Isambard 3 will deliver 5-6 times the performance of Isambard 2, while using only 20% more power. In this talk we will describe the new system, as well as giving an update on the progress of Isambard's multi-year mission to port and optimise codes to the Arm architecture.
Nam Ho	Julich	Memory Prefetching Evaluation using gem5 Simulations Abstract 　Significantly increased memory bandwidth is increasingly difficult to exploit in standard multicore CPU architectures. Memory prefetchers play an important role in hiding memory access latencies and ensuring sufficiently high memory-level parallelism. In this talk, we report on ongoing efforts for exploring their impact on various HPC benchmarks and mini-apps that implement performance-critical kernels of Lattice Boltzmann Method, finite element, and reverse time migration methods. Using modern Arm core cores we explore the performance impact of different memory prefetchers solutions and configurations that have been implemented in the gem5 simulator.
Carlos Falquez	Julich	Studying different BFS algorithm implementations with gem5 Abstract 　To leverage different CPU features, different implementations of the breadth-first search (BFS) algorithm have been proposed. The gem5 simulator provides the opportunity to investigate how these implementations exploit different CPU configurations. For our study, we assume a modern Arm processor core like Neoverse V1 and report on the impact of different SVE pipelines, cache, network-on-chip, and memory configurations.
Chen Liu	Clarkson	High Frequency Performance Monitoring via Architectural Event Measurement Abstract 　Obtaining detailed software execution information via performance monitoring counters is a powerful analysis technique. Performance counters provide an effective method to monitor program behaviors; hence performance bottlenecks due to hardware architecture or software design and implementation can be identified, isolated and improved on. The granularity and overhead of the monitoring mechanism, however, are paramount to proper analysis. Many prior designs have been able to provide performance counter monitoring with inherited drawbacks such as intrusive code changes, a slow timer system, or the need for a kernel patch. In this session, we introduce K-LEB (Kernel - Lineage of Event Behavior), a new monitoring mechanism that can produce precise, non-intrusive, low-overhead, periodic performance counter data, and support ARM processors. In this talk, we will discuss the performance counter profiling tools design choice and implementation and how to utilize the performance monitoring counter for the low-cost software analysis and its applications.
Luka Stanisic	Huawei	Performance Evaluation of the Ginkgo Sparse Linear Solver Framework on Arm Abstract 　The Ginkgo linear algebra library provides a set of preconditioners and iterative solvers for sparse systems. Ginkgo receives attention for supporting accelerators, but also targets CPUs with OpenMP kernels that we focus on. We characterize the behavior of Ginkgo’s benchmarks (SpMV with 5 formats, matrix conversions, 7 solvers, 9 preconditioners) wrt. hot kernels, top-down analysis, roofline model and working set, along with OpenMP imbalance, pragma use and thread placement on one AArch64 Huawei Kunpeng920 system using GNU GCC and 10 matrices from SuiteSparse. Selected results are complemented with evaluations on AWS Graviton3, 3rd-gen Intel Xeon and AMD EPYC. We offer guidance for optimizing the OpenMP executor by identifying tuning opportunities and pitfalls. The solvers are in the memory-bound region of the roofline model with arithmetic intensity from 0.1 to 3, leading to a FLOPS efficiency below 1%. Scalability is limited by OpenMP imbalance of up to 43% for selected usecases.cThe Ginkgo linear algebra library provides a set of preconditioners and iterative solvers for sparse systems. Ginkgo receives attention for supporting accelerators, but also targets CPUs with OpenMP kernels that we focus on. We characterize the behavior of Ginkgo’s benchmarks (SpMV with 5 formats, matrix conversions, 7 solvers, 9 preconditioners) wrt. hot kernels, top-down analysis, roofline model and working set, along with OpenMP imbalance, pragma use and thread placement on one AArch64 Huawei Kunpeng920 system using GNU GCC and 10 matrices from SuiteSparse. Selected results are complemented with evaluations on AWS Graviton3, 3rd-gen Intel Xeon and AMD EPYC. We offer guidance for optimizing the OpenMP executor by identifying tuning opportunities and pitfalls. The solvers are in the memory-bound region of the roofline model with arithmetic intensity from 0.1 to 3, leading to a FLOPS efficiency below 1%. Scalability is limited by OpenMP imbalance of up to 43% for selected usecases.
Gilles Tourpe	AWS	Hackathon with TERATEC - Feedback Abstract 　AWS, ARM, UCit and TERATEC organized a Hackathon for HPC masters universities. 10 teams competed on porting a stencil code (contributed by CGG) and code_saturne (contributed by EDF R&D) on AWS graviton3 instances. The talk proposal is to report on the learnings. This talk will be supported by AWS and ARM.
Coffee Break
Brendan Bouffler	AWS	Updates from the field on Graviton 3E for HPC Abstract 　Graviton 3E was engineered specifically for HPC customers and we've also launched an Hpc7g instance family, based on this processor, coupled with 200 Gb/s of Elastic Fabric Adapter. We'll explain how this works, how to get access to these using HPC tooling, and show the performance results we're seeing - contributed by customers.
Etienne Renault	SiPearl	Evaluation and performance projections for ARM chips Abstract 　The variety of ARM chips (and SOC) that are used in the HPC realm make it difficult to anticipate performances on other ARM based architectures. This talk compares the relatives performances of ARM chips on different HPC benchmarks and shows some strategies to anticipate results with a variety of different configurations.
Filippo Spiga	NVIDIA	Accelerating time-to-science with the NVIDIA Superchip platform Abstract The present talk will highlight how the NVIDIA Superchip platform (Grace Superchip and Grace Hopper Superchip), enabled by Arm Neoverse IP, is enabling HPC users and developers to achieve a better time-to-solution and energy-to-solution in their scientific workflows. Performance numbers will be disclosed and discussed, as well as the latest updates on the product go-to-market options.
Beau Paisley	Linaro	An Analysis of Arm Graviton systems Using Linaro Performance Reports Abstract 　In this presentation we will give an overview of Linaro Performance Reports, an application profiler for HPC applications. We will use the tool to create a case study for analyzing various configurations of the WRF weather code on various configurations of AWS Graviton systems.
	Miwako Tsuji (AHUG)	Wrap-up session and announcements

Organizers

Filippo Spiga (NVIDIA)
Simon McIntosh-Smith (University of Bristol)
Eva Siegmann (Stony Brook University)
Miwako Tsuji (RIKEN)