International Workshop on Arm-based HPC: Practice and Experience (IWAHPCE-2025)

 
to be held in conjunction with The International Conference on High Performance Computing in Asia-Pacific Region (HPC Asia 2025), Hsinchu, Taiwan, Feb.19-21, 2025.
 
Workshop Overview

This workshop aims to provide the opportunity to share the practice and experience of high-performance computing systems using the Arm architecture and their performance and applications.The last few years have seen an explosion of 64-bit Arm-based processors targeted toward server and infrastructure workloads, often specializing in a specific domain such as HPC, cloud, and machine learning. Fujitsu’s A64FX and Marvell’s ThunderX2 have been used in several large-scale HPC systems, and Amazon’s Graviton2 has been adopted by Amazon EC2. Moreover, Amazon’s Graviton3, NVIDIA Grace CPU Superchip, and SiPearl’s Rhea system-on-chip are recently announced or become accessible.Sharing the practice and experiences using these Arm-based processors will contribute to advancing high-performance computing technology for newly designed systems using these new emerging Arm-based processors.

Program

    
   
09:00-09:10   Miwako TsujiOpening Remarks
09:10-10:00   Koichi ShirahataInvited Talk: Fugaku-LLM: A Large Language Model Trained on the Supercomputer Fugaku (slides)
Abstract: While not initially designed for large-scale deep learning models like Large Language Models (LLMs), Japan's flagship supercomputer, Fugaku, provided a unique opportunity. This work details the optimization of deep learning frameworks for distributed parallel execution on Fugaku, creating a high-performance computing environment for LLM training. A novel LLM was trained from scratch using a large dataset primarily focused on Japanese text.
10:00-10:30   Shinji Sumimoto, Takashi Arakawa, Yoshio Sakaguchi, Hiroya Matsuba, Satoshi Ohshima, Hisashi Yashiro, Toshihiro Hanawa, Kengo NakajimaAccelerating Heterogeneous Coupling Computing with WaitIO Using RDMA
Abstract: In this paper, we propose communication libraries WaitIO-Verbs andWaitIO-Tofu using RDMA communication to speed up communication performance in the h3-OpenSYS/WaitIO (WaitIO) library, which can connect multiple MPI programs across multiple heterogeneous systems. It is important to use industry-standard communication methods for communication between heterogeneous systems, and WaitIO implements WaitIO-Socket and WaitIO-File, which use POSIX-based specifications for socket and file IO. However, since POSIX specifications generally use system calls, there is a possibility that sufficient performance may not be obtained depending on the system. Therefore, to further speed up communication, we implemented WaitIO-Verbs and WaitIO-Tofu using user-level RDMA using industry-standard or default system communication specifications. As a result of implementation and evaluation, we achieved high communication performance and application performance. WaitIO achieved high application performance even between multiple heterogeneous clusters, which MPI could not achieve.
10:30-11:00   Break
11:00-11:30   Oscar Hernandez, Thomas Wang, Wael Elwasif, Filippo Spiga, Francesca Tartaglione, Markus Eisenbach, Ross MillerPreliminary Study on Fine-Grained Power and Energy Measurements on Grace Hopper GH200 with Open-Source Performance Tools
Abstract: The increasing adoption of tightly integrated, heterogeneous architectures, combined with the slowdown of Moore’s law, has made application power and energy-driven optimizations critical to efficiently use high-performance computing systems. This paper introduces a newly developed open-source toolkit that seamlessly integrates the Linux real-time hardware monitoring program hwmon with the Performance Application Programming Interface and the Score-P performance measurement system, thereby enabling finegrained power and energy measurements for high-performance computing applications. Our primary target platform is theWombat test bed, which is a system based on the NVIDIA GH200 superchip. The toolkit can capture transient power peaks with high temporal resolution (50 ms) and, thanks to Score-P integration, can map power metrics to specific code regions, thereby providing actionable information on power-intensive operations and inefficiencies. The toolkit also provides a holistic view of both the power and the energy consumption of the entire GH200 superchip by covering all major components: the Grace CPU, the Hopper GPU, and the I/O subsystem. Experiments that use Locally Self-consistent Multiple Scattering, which is an application for first-principles calculations of materials developed at Oak Ridge National Laboratory, have demonstrated the tool’s ability to identify transient power spikes and uncover opportunities for energy-aware optimizations. Additionally, we introduce a Python-based utility for converting Open Trace Format 2 traces to Parquet format, thus enabling advanced data analysis for numerical integration methods applied to power data for accurate energy profiling.
11:30-12:00   David Carlson, Nikolay Simakov, Rodrigo Ristow Hadlich, Anthony Curtis, Joshua Martin, Gaurav Verma, Smeet Chheda, Firat Coskun, Raul Gonzalez, Daniel Wood, Feng Zhang, Robert Harrison, Eva SiegmannThe AmpereOne A192-32X in Perspective: Benchmarking a New Standard
Abstract: This study presents a comprehensive benchmarking analysis of the Arm-based AmpereOne A192-32X CPU, a high-performance but low power processor designed for cloud-native workloads characterized by high core occupancy, imperfectly-vectorized or even pure scalar software, limited need for high floating-point performance, and, increasingly, AI inference. These traits also characterize much of academic research computing . Hence a thorough investigation of this novel CPU seeking to characterize its strengths and weaknesses on academic workloads, including traditional HPC codes for which it was not designed, will shed light on its relevance in a research setting. We report comparative analyses with contemporary CPUs (Intel Sapphire Rapids, AMD EPYC, NVIDIA Grace-Grace) and illustrate AmpereOne’s architectural advantages in handling parallel workloads and optimizing power consumption. The CPUs are compared in terms of performance and power consumption using a wide range of applications covering different workloads and disciplines.
12:00-12:30   Xuanzhengbo Ren, Yuta Kawai, Hirofumi Tomita, Seiya Nishizawa, Takahiro Katagiri, Tetsuya Hoshino, Daichi Mukunoki, Masatoshi Kawai, Toru NagaiPerformance Evaluation of Loop Body Splitting for Fast Modal Filtering in SCALE-DG on A64FX
Abstract: Modern general-purpose central processing units (CPUs) benefit from the integration of Single Instruction, Multiple Data (SIMD) architectures. Scalable Vector Extensions (SVE) is one of Arm’s SIMD architectures designed for HPC. Fujitsu’s A64FX is the first Armbased processor to incorporate hardware-implemented SVE alongside High-Bandwidth Memory 2 (HBM2). However, in the A64FX, the high latencies of SIMD instructions, such as fused multiply-add (FMA), combined with the limited capacity of reservation stations, can result in inefficient Out-of-Order (OoO) execution. SCALEDG is an atmospheric dynamical core that uses the discontinuous Galerkin Method (DGM). Modal filtering, an essential procedure in SCALE-DG, has an optimized version called fast modal filtering, which suffers from the OoO execution issue due to dot products with long vector lengths in the loop body. This issue can be alleviated by splitting the loop body into multiple parts. However, since the vector length of the dot product varies with the polynomial order (𝑃) in SCALE-DG, it remains unclear how the number of splits affects the performance of fast modal filtering. In this paper, we present an evaluation across 𝑃 values in the range of 3 to 11 with different splitting numbers and combinations. The results indicate that when 𝑃 ≤ 7, performance degraded in most cases, with only a few cases achieving positive speedups (1.01x to 1.02x) after splitting the loop body. For 8 ≤ 𝑃 ≤ 11, splitting had a consistently positive impact. The 3-way splitting was identified as the optimal configuration, achieving speedups of 1.15x to 1.26x.


Topics

In particular, this workshop will focus on the following topics of interest:
   - HPC Applications
   - Performance Analysis, Performance Modeling & Measurement
   - SVE Vectorization analysis
   - Programming Models & System Softoware
   - Networking and accelerators such as GPUs
   - Artificial Intelligence and Machine Learning
   - Emerging Technologies

Paper Submissions

All papers must be original and not simultaneously submitted to another journal or conference. The following paper categories are welcome:
   - Full papers Manuscripts must be at most 18 pages in one-column submission PDF format including figures and references
   - Short papers Manuscripts must be at most 10 pages in one-column submission PDF format including figures and references
The paper format is described in the Paper Submission section of HPCAsia2025
Please note that the paper format for the submission (one-column) is different from the one of the camera ready (2-column). Fore more detail, please refer https://www.acm.org/publications/authors/submissions
All submissions will be peer-reviewed by the PC members. Papers will be accepted to presentations in a workshop. The review process is double-blind. Please do NOT include the name of authors, etc...

Important Dates

   - Full Paper Submission (via LinklingsEasychair): 23th Dec 2024
   - Notification: 12th Jan 2025
   - Camera ready: 15th Jan 2025

Submission Site

https://easychair.org/conferences/?conf=iwahpce2025

Registration and Open Access Fee

   - At least one author must register for the conference
   - Based on the ACM's new Open Access publication policy, the publication fee will be contacted with and collected by ACM directly from the authors.
   - For more details about the ACM Open Access policy and fee, please refer here and here.
   - However, if your institution is a member of the ACM Open program, the paper will be published with no charge , as indicated on the web site. The list of participating institutions of ACM Open can be viewed https://libraries.acm.org/acmopen/open-participants.

Organizers and Program Committee

Organizer and Workshop Chair
Miwako Tsuji, RIKEN R-CCS
Eva Siegmann, Stony Brook University
Filippo Spiga, NVIDIA

Program Committee
Conrad Hillairet   Arm
Csaba Csoma   ASW
Estela Suarez   JSC
Eva Siegmann  Stony Brook University
Fabio Banchelli   BSC
Filippo Spiga  NVIDIA
Gilles Fourestey   EPFL
Jens Domke   RIKEN R-CCS
John Cazes   TACC
Luca Fedeli   CEA
Min Li   Huawei
Miwako Tsuji   RIKEN R-CCS
Tetsuya Odajima   Fujitsu
Wael Elwas  ORNL
Yuetsu Kodama   RIKEN R-CCS