Arm HPC User Group Workshop @ ISC24

An ISC 2024 Workshop

This project is maintained by arm-hpc-user-group

Arm HPC User Group Workshop @ ISC24

The 2024 Arm HPC User Group (AHUG) Workshop is held in conjunction with ISC High Performance 2024 in Hamburg, Germany.

Date & Time: May 16th, 2024 @ 9:00am - 1:00pm
Location: Hall Y10 - 2nd floor, Congress Center Hamburg (CCH), Germany

Join the AHUG Slack channel!

Timetable and Agenda

Time Duration Title Speaker Affiliation
09:00-09:10 10m Welcome Remarks & Plenary Address to the AHUG Community Filippo Spiga AHUG
09:10-09:45 35m Invited Talk
Isambard-3 and Isambard-AI
Simon McIntosh-Smith University of Bristol / GW4
09:45-10:10 25m Performance analyses of benchmark applications on different A64FX architectures Seydou Ba RIKEN R-CCS
10:10-10:35 25m NVIDIA Grace Superchip Early Evaluation for HPC Applications Fabio Banchelli BSC
10:35-11:00 25m Accelerating Hierarchical Collective Communication on next-gen ARM architectures Alon Zameret Toga Networks - Huawe
11:00-11:30 30m Coffee break
11:30-11:55 25m Running Arm Accelerated Solutions for Engineering Workflows in the Cloud with Rescale Sam Zakrzewski Rescale
11:55-12:20 25m Extending Arm’s Reach by Going EESSI Kenneth Hoste Ghent University
12:20-12:40 20m NVIDIA Grace Superchip Filippo Spiga NVIDIA
12:40-13:00 20m EDF R&D Code_Saturne performance on AWS HPC7g instance Conrad Hillairet Arm Ltd & AWS


Performance analyses of benchmark applications on different A64FX architectures

Speaker: Seydou Ba (RIKEN R-CCS)
Modern supercomputers are increasingly complex and are utilizing faster and denser core count processors per node. This work focuses on the performance comparison of different node architectures of Fujitsu’s A64FX, namely the FX1000 (Fugaku) and two versions of the FX700. The architectures differ mainly in that, the FX1000 has 48cores per node and uses the TofuD interconnect, while the FX700s use Infiniband and one version has 48cores per node and the other has 24cores per node at a higher frequency. We are monitoring performance with profilers and analysis tools to conduct detailed performance studies with key benchmark applications. Furthermore, we aim to expend this study with the purpose of analyzing hardware option for interconnections leaning toward photonics design for next-gen supercomputers.

NVIDIA Grace Superchip Early Evaluation for HPC Applications

Speaker: Fabio Banchelli (BSC)
Arm-based system in HPC are a reality since more than a decade. However, when a new chip enters the market always implies challenges, not only at ISA level, but also with regards to the SoC integration, the memory subsystem, the board integration, the node interconnection, and finally the OS and all layers of the system software (compiler and libraries). Guided by the procurement of an NVIDIA Grace HPC cluster within the deployment of MareNostrum 5, and emulating the approach of a scientist who needs to migrate its scientific research to a new HPC system, we evaluated five complex scientific applications on engineering sample nodes of NVIDIA Grace CPU Superchip and NVIDIA Grace Hopper Superchip (CPU-only). We report intra-node and inter-node scalability and early performance results showing a speed-up between 1.3x and 4.28x for all codes when compared to the current generation of MareNostrum 4 powered by Intel Skylake CPUs.

Accelerating Hierarchical Collective Communication on next-gen ARM architectures

Speaker: Alon Zameret (Toga Networks - Huawei)
In large-scale distributed computing, collective communication often poses a bottleneck due to the latency and bandwidth limitations of modern networks. In this work we propose an innovative hierarchical approach to accelerate collective communication, based on multiple levels of communication aggregation (intra-node, inter-node, inter-rack). Assigning multiple representatives at each level enables communication and data partitioning for alleviating latency/bandwidth bounds in HPC and AI applications. Leveraging the SIMD extensions (e.g. SVE) offered in next-generation ARM architectures further improves performance by optimizing memory copy and reduction operations. The resulting communication component, MLMR, demonstrates up to 5x reduction of AllReduce collective communication on an ARM cluster with more than 12k cores. Further work is underway to introduce payload pipelining in order to overlap inter-level communication, and enhance performance.

Running Arm Accelerated Solutions for Engineering Workflows in the Cloud with Rescale

Speaker: Sam Zakrzewski (Rescale)
As computational demands in engineering continue to rise, leveraging cloud computing resources becomes increasingly imperative. This presentation delves into the optimization and efficiency gains achieved by running engineering workflows in the cloud on Arm-based hardware through Rescale’s platform. Case studies and demonstrations illustrate how Rescale’s platform enables seamless migration of engineering workflows to Arm-based cloud instances, unlocking unparalleled performance and cost-effectiveness. Key topics covered include the technical considerations of transitioning to Arm architecture, performance benchmarks and cost comparisons illustrating the economic benefits of cloud-based Arm computing. Attendees will gain a comprehensive understanding of how adopting Arm accelerated solutions for engineering workflows on Rescale’s platform empowers organizations to tackle complex simulations with unprecedented efficiency, driving innovation and competitiveness.

Extending Arm’s Reach by Going EESSI

Speaker: Kenneth Hoste (University of Genth)
In the European Environment for Scientific Software Installation (EESSI) community project (, we provide a stack of optimized scientific software installations that work on any Linux system, regardless of whether it is powered by Intel, AMD, or Arm CPUs (soon also RISC-V). This effort is currently funded through the EuroHPC Centre-of-Excellence MultiXscale ( In this talk, we will share our experiences with building a wide range of scientific applications, libraries, and required dependencies for different Arm microarchitectures. We encountered (and fixed) various problems along the way, especially when targetting Arm Neoverse V1, and when running software test suites. Additionally, we will demonstrate how you can get access in a matter of minutes to a rich set of optimized software installations for Arm systems, including Raspberry PI’s, cloud instances powered by an Arm CPU, and Arm-based EuroHPC supercomputers like Deucalion and JUPITER.