Arm HPC User Group Workshop @ ISC25

The 2025 Arm HPC User Group (AHUG) Workshop is held in conjunction with ISC High Performance 2025 in Hamburg, Germany.

Date & Time: June 13rd, 2025 @ 9:00am - 1:00pm
Location: Hall X5 - 1st floor, Congress Center Hamburg (CCH), Germany

Join the AHUG Slack channel!

Timetable and Agenda

Time	Duration	Title	Speaker
09:00-09:05	5m	Welcome & Plenary address	Filippo Spiga (NVIDIA / AHUG)
09:05-09:45	40m	Invited Talk Fugaku-LLM: A Large Language Model Trained on the Supercomputer Fugaku – (SLIDES)	Koichi Shirahata (Fujitsu)
09:45-10:10	25m	An Overview of the Maturity of SYCL Implementations and Backends for AArch64 – (SLIDES)	Etienne Renault (SiPearl)
10:10-10:35	25m	Early results from Isambard 3, one of the first NVIDIA Grace CPU-based systems – (SLIDES)	Tom Green (Univ. Bristol, BriCS), Simon McIntosh-Smith (Univ. Bristol, BriCS)
10:35-11:00	25m	Porting and tuning GROMACS on Arm SVE – (SLIDES)	Gilles Gouaillardet (RIST)
11:00-11:30	30m	Coffee break
11:30-11:55	25m	MareNostrum5’s Graceful landing – (SLIDES)	Majesa Trimmel (BSC), Fabio Banchelli (BSC)
11:55-12:20	25m	Advancing the ARM Ecosystem for European Scientific Flagship Codes – (SLIDES)	Erwan Raffin (EVIDEN)
12:20-12:45	25m	Exploring compiler behavior on an industrial application (OpenRadioss) on modern arm processors – (SLIDES)	Hugo Bolloré (UVSQ)
12:45-13:00	15m	Closing Remarks	Conrad Hillairet (Arm Ltd), Filippo Spiga (NVIDIA / AHUG)

Abstracts

(Invited Talk) Fugaku-LLM: A Large Language Model Trained on the Supercomputer Fugaku

Speaker: Koichi Shirahata (Fujitsu)
While not initially designed for large-scale deep learning models like Large Language Models (LLMs), Japan’s flagship supercomputer, Fugaku, provided a unique opportunity. This work details the optimization of deep learning frameworks for distributed parallel execution on Fugaku, creating a high-performance computing environment for LLM training. A novel LLM was trained from scratch using a large dataset primarily focused on Japanese text.

An Overview of the Maturity of SYCL Implementations and Backends for AArch64

Speaker: Etienne Renault (SiPearl)
Modern HPC systems are heterogeneous at both processing units level and memory level. This heterogeneity can be leveraged using SYCL which is a programming model that allow to abstract devices where some part of the computation is offloaded. With increasing performance, ARM cores are today a suitable option for offloading. This talk will focus on the AArch64 SYCL ecosystem and its implementations, each supporting a variety of backends. In this context application developers need to carefully consider each combination of [framework x backend x device] in order to take the most out of their code. This talk focuses on performance of two SYCL implementations (DPC++ and AdaptiveCpp) for AArch64 and evaluate them against HeCBench. The aim is to help the developer to quickly pick the correct backend according to his/her needs. A peak on our latest advances to strengthen the SYCL ecosystem will be presented : among other integration of NUMA awareness and Outer-Loop Vectorization.

Early results from Isambard 3, one of the first NVIDIA Grace CPU-based systems

Speaker: Tom Green, Prof. Simon McIntosh-Smith (University of Bristol, BriCS)
The University of Bristol has been running production Arm-based HPC services since commissioning the Isambard 1 supercomputer back in 2018. This first system was based on Marvell ThunderX2 Arm-based CPUs, and the follow-on, Isambard 2, continued the theme, adding Fujitsu A64fx processors. Isambard 3 has recently gone into production, delivering one of the first substantial NVIDIA Grace-Grace CPU systems. Isambard 3’s nodes each deliver 144 fast Arm Neoverse v2 cores and around 1 TB/s of memory bandwidth. The system is delivered by HPE, and exploits their Slingshot 11 200 Gbps network to connect the 384 nodes in the system, for a total of 55,296 cores. In this talk we will present early results from real users running a wide range of scientific applications on the system, as well as summarising useful lessons learned so far.

Porting and tuning GROMACS on Arm SVE

Speaker: Gilles Gouaillardet (RIST)
This short talk will explain how SVE support was added to GROMACS, the challenges that were met and how they were overcome, and describe the optimizations that were implemented in order to improve performances on A64fx.

MareNostrum5’s Graceful landing

Speaker: Fabio Banchelli, Majesa Trimmel (BSC)
In this talk, we introduce the Next Generation General Purpose Partition of MareNostrum 5, deployed at the Barcelona Supercomputing Center. It comprises 400 nodes based on the NVIDIA Grace CPU Superchip. Our goal is to share our experience working on a full-scale Grace-based system. We present results from simple synthetic micro-benchmarks to production-ready scientific codes. Micro-benchmarks allow us to evaluate specific hardware features, while HPC applications allow us to conduct experiments across multiple nodes and analyze the system’s scalability. We also leverage the power monitoring infrastructure deployed at BSC, which has been extended to support the power telemetry tools provided by the Grace CPU. Lastly, we share some insights on the system software, such as the multiple compiler toolchains and math libraries to choose from.

Advancing the ARM Ecosystem for European Scientific Flagship Codes

Speaker: Erwan Raffin (EVIDEN)
The EMOPASS project aims at improving the Arm ecosystem for HPC. This includes the analysis of flagship HPC codes on Arm latest HPC processors using leading-edge profiling tool, MAQAO. This tool is being used to study, among key scientific applications, flagship codes from several European CoEs, leading to feedback and further improvements to the applications, compilers and to MAQAO itself. We will present several studies made in the context of the project, where these CoEs applications and kernels were used to evaluate compilers and micro-architecture capabilities. Moreover, We will look at the positive feedback loop this helped create for end-users of Arm CPUs. This includes new features in MAQAO such as thread binding and activity, and improvements to LLVM such as improved vectorization and Fortran support in the new flang front-end.

Exploring compiler behavior on an industrial application (OpenRadioss) on modern arm processors

Speaker: Hugo Bolloré (University of Versailles Saint-Quentin -UVSQ)
On modern processors, compiler task is very difficult (dealing with large Instruction Level Parallelism and various vector lengths/instruction sets available) and often compilers generate suboptimal code. In this presentation, we will demonstrate how MAQAO (www.maqao.org) can be used to explore and detect compiler mistakes or suboptimal decisions. Such an exploration will rely on MAQAO advanced binary analysis features combined with simplified performance models (Code Quality Analysis). This work will be carried out with two compilers ACFL and GFortran using different compiler flags on OpenRadioss, a well known crash & impact open-source code. Target hardware will be AWS Graviton3 and AWS Graviton4 processors. A large effort will be devoted to perform comparative analysis between different compilers, compiler options and hardware.