| |
International Workshop on Arm-based HPC: Practice and Experience (IWAHPCE-2025)
to be held in conjunction with The International Conference on High Performance Computing in Asia-Pacific Region (HPC Asia 2025), Hsinchu, Taiwan, Feb.19-21, 2025.
Workshop Overview
This workshop aims to provide the opportunity to share the practice and experience of high-performance computing systems using the Arm architecture and their performance and applications.The last few years have seen an explosion of 64-bit Arm-based processors targeted toward server and infrastructure workloads, often specializing in a specific domain such as HPC, cloud, and machine learning. Fujitsu’s A64FX and Marvell’s ThunderX2 have been used in several large-scale HPC systems, and Amazon’s Graviton2 has been adopted by Amazon EC2. Moreover, Amazon’s Graviton3, NVIDIA Grace CPU Superchip, and SiPearl’s Rhea system-on-chip are recently announced or become accessible.Sharing the practice and experiences using these Arm-based processors will contribute to advancing high-performance computing technology for newly designed systems using these new emerging Arm-based processors.
Program
| | |
09:00-09:10 | Miwako Tsuji | Opening Remarks |
09:10-10:00 | Koichi Shirahata | Invited Talk: Fugaku-LLM: A Large Language Model Trained on the Supercomputer Fugaku (slides)
Abstract:
While not initially designed for large-scale deep learning models like Large Language Models (LLMs), Japan's flagship supercomputer, Fugaku, provided a unique opportunity. This work details the optimization of deep learning frameworks for distributed parallel execution on Fugaku, creating a high-performance computing environment for LLM training. A novel LLM was trained from scratch using a large dataset primarily focused on Japanese text.
|
10:00-10:30 | Shinji Sumimoto, Takashi Arakawa, Yoshio Sakaguchi, Hiroya Matsuba, Satoshi Ohshima, Hisashi Yashiro, Toshihiro Hanawa, Kengo Nakajima | Accelerating Heterogeneous Coupling Computing with WaitIO Using RDMA Abstract: In this paper, we propose communication libraries WaitIO-Verbs
andWaitIO-Tofu using RDMA communication to speed up communication
performance in the h3-OpenSYS/WaitIO (WaitIO) library,
which can connect multiple MPI programs across multiple heterogeneous
systems. It is important to use industry-standard communication
methods for communication between heterogeneous systems,
and WaitIO implements WaitIO-Socket and WaitIO-File, which use
POSIX-based specifications for socket and file IO. However, since
POSIX specifications generally use system calls, there is a possibility
that sufficient performance may not be obtained depending on
the system. Therefore, to further speed up communication, we implemented
WaitIO-Verbs and WaitIO-Tofu using user-level RDMA
using industry-standard or default system communication specifications.
As a result of implementation and evaluation, we achieved
high communication performance and application performance.
WaitIO achieved high application performance even between multiple
heterogeneous clusters, which MPI could not achieve. |
10:30-11:00 | | Break |
11:00-11:30 | Oscar Hernandez, Thomas Wang, Wael Elwasif, Filippo Spiga, Francesca Tartaglione, Markus Eisenbach, Ross Miller | Preliminary Study on Fine-Grained Power and Energy Measurements on Grace Hopper GH200 with Open-Source Performance Tools Abstract: The increasing adoption of tightly integrated, heterogeneous architectures,
combined with the slowdown of Moore’s law, has made
application power and energy-driven optimizations critical to efficiently
use high-performance computing systems. This paper introduces
a newly developed open-source toolkit that seamlessly
integrates the Linux real-time hardware monitoring program hwmon
with the Performance Application Programming Interface and the
Score-P performance measurement system, thereby enabling finegrained
power and energy measurements for high-performance
computing applications. Our primary target platform is theWombat
test bed, which is a system based on the NVIDIA GH200 superchip.
The toolkit can capture transient power peaks with high temporal
resolution (50 ms) and, thanks to Score-P integration, can map
power metrics to specific code regions, thereby providing actionable
information on power-intensive operations and inefficiencies.
The toolkit also provides a holistic view of both the power and the
energy consumption of the entire GH200 superchip by covering all
major components: the Grace CPU, the Hopper GPU, and the I/O
subsystem. Experiments that use Locally Self-consistent Multiple
Scattering, which is an application for first-principles calculations
of materials developed at Oak Ridge National Laboratory, have
demonstrated the tool’s ability to identify transient power spikes
and uncover opportunities for energy-aware optimizations. Additionally,
we introduce a Python-based utility for converting Open
Trace Format 2 traces to Parquet format, thus enabling advanced
data analysis for numerical integration methods applied to power
data for accurate energy profiling. |
11:30-12:00 | David Carlson, Nikolay Simakov, Rodrigo Ristow Hadlich, Anthony Curtis, Joshua Martin, Gaurav Verma, Smeet Chheda, Firat Coskun, Raul Gonzalez, Daniel Wood, Feng Zhang, Robert Harrison, Eva Siegmann | The AmpereOne A192-32X in Perspective: Benchmarking a New Standard Abstract: This study presents a comprehensive benchmarking analysis of the Arm-based AmpereOne A192-32X CPU, a high-performance but
low power processor designed for cloud-native workloads characterized by high core occupancy, imperfectly-vectorized or even
pure scalar software, limited need for high floating-point performance, and, increasingly, AI inference. These traits also characterize
much of academic research computing . Hence a thorough investigation of this novel CPU seeking to characterize its strengths and
weaknesses on academic workloads, including traditional HPC codes for which it was not designed, will shed light on its relevance in a
research setting. We report comparative analyses with contemporary CPUs (Intel Sapphire Rapids, AMD EPYC, NVIDIA Grace-Grace)
and illustrate AmpereOne’s architectural advantages in handling parallel workloads and optimizing power consumption. The CPUs
are compared in terms of performance and power consumption using a wide range of applications covering different workloads and
disciplines. |
12:00-12:30 | Xuanzhengbo Ren, Yuta Kawai, Hirofumi Tomita, Seiya Nishizawa, Takahiro Katagiri, Tetsuya Hoshino, Daichi Mukunoki, Masatoshi Kawai, Toru Nagai | Performance Evaluation of Loop Body Splitting for Fast Modal Filtering in SCALE-DG on A64FX Abstract: Modern general-purpose central processing units (CPUs) benefit
from the integration of Single Instruction, Multiple Data (SIMD) architectures.
Scalable Vector Extensions (SVE) is one of Arm’s SIMD
architectures designed for HPC. Fujitsu’s A64FX is the first Armbased
processor to incorporate hardware-implemented SVE alongside
High-Bandwidth Memory 2 (HBM2). However, in the A64FX,
the high latencies of SIMD instructions, such as fused multiply-add
(FMA), combined with the limited capacity of reservation stations,
can result in inefficient Out-of-Order (OoO) execution. SCALEDG
is an atmospheric dynamical core that uses the discontinuous
Galerkin Method (DGM). Modal filtering, an essential procedure
in SCALE-DG, has an optimized version called fast modal filtering,
which suffers from the OoO execution issue due to dot products
with long vector lengths in the loop body. This issue can be alleviated
by splitting the loop body into multiple parts. However, since
the vector length of the dot product varies with the polynomial
order (𝑃) in SCALE-DG, it remains unclear how the number of
splits affects the performance of fast modal filtering. In this paper,
we present an evaluation across 𝑃 values in the range of 3 to 11
with different splitting numbers and combinations. The results indicate
that when 𝑃 ≤ 7, performance degraded in most cases, with
only a few cases achieving positive speedups (1.01x to 1.02x) after
splitting the loop body. For 8 ≤ 𝑃 ≤ 11, splitting had a consistently
positive impact. The 3-way splitting was identified as the optimal
configuration, achieving speedups of 1.15x to 1.26x. |
Topics
In particular, this workshop will focus on the following topics of interest:
- HPC Applications
- Performance Analysis, Performance Modeling & Measurement
- SVE Vectorization analysis
- Programming Models & System Softoware
- Networking and accelerators such as GPUs
- Artificial Intelligence and Machine Learning
- Emerging Technologies
Paper Submissions
All papers must be original and not simultaneously submitted to another journal or conference. The following paper categories are welcome:
- Full papers Manuscripts must be at most 18 pages in one-column submission PDF format including figures and references
- Short papers Manuscripts must be at most 10 pages in one-column submission PDF format including figures and references
The paper format is described in the Paper Submission section of HPCAsia2025
Please note that the paper format for the submission (one-column) is different from the one of the camera ready (2-column). Fore more detail, please refer https://www.acm.org/publications/authors/submissions
All submissions will be peer-reviewed by the PC members. Papers will be accepted to presentations in a workshop. The review process is double-blind. Please do NOT include the name of authors, etc...
Important Dates
- Full Paper Submission (via LinklingsEasychair): 23th Dec 2024
- Notification: 12th Jan 2025
- Camera ready: 15th Jan 2025
Submission Site
https://easychair.org/conferences/?conf=iwahpce2025
Registration and Open Access Fee
- At least one author must register for the conference
- Based on the ACM's new Open Access publication policy, the publication fee will be contacted with and collected by ACM directly from the authors.
- For more details about the ACM Open Access policy and fee, please refer here and here.
- However, if your institution is a member of the ACM Open program, the paper will be published with no charge , as indicated on the web site. The list of participating institutions of ACM Open can be viewed https://libraries.acm.org/acmopen/open-participants.
Organizers and Program Committee
Organizer and Workshop Chair
Miwako Tsuji, RIKEN R-CCS
Eva Siegmann, Stony Brook University
Filippo Spiga, NVIDIA
Program Committee
Conrad Hillairet Arm
Csaba Csoma ASW
Estela Suarez JSC
Eva Siegmann Stony Brook University
Fabio Banchelli BSC
Filippo Spiga NVIDIA
Gilles Fourestey EPFL
Jens Domke RIKEN R-CCS
John Cazes TACC
Luca Fedeli CEA
Min Li Huawei
Miwako Tsuji RIKEN R-CCS
Tetsuya Odajima Fujitsu
Wael Elwas ORNL
Yuetsu Kodama RIKEN R-CCS
|