LUMI-G Early Access Platform¶
Warning
This page is a preliminary of the guide documenting the use of the LUMI-G Early Access Platform (EAP). It contains information specific to the latter, but does not stand on its own. Users of the EAP are also invited to read other sections of the LUMI documentation. In particular, you are invited to read the section on the module system and the programming environment.
The LUMI early access platform consists of nodes with MI100 GPUs that are the predecessors of the MI250X GPUs that will be in LUMI-G, with the intended use case being to give users access to the software stack so that they can work on preparing their software for the LUMI-G nodes when they arrive. Specifically the hardware is 14 nodes (starting with 4 being available) with the hardware summarized in the table below:
Nodes | CPUs | Memory | GPUs | Local storage | Network |
---|---|---|---|---|---|
14 | 1x 64 cores AMD EPYC 7662 2.0 GHz base 3.3 GHz boost |
512GB | 4x AMD MI100 | 2x 3TB NVMe | 1x 100 Gb/s |
The MI100 GPU is the predecessor the the MI250X that will be in LUMI-G. As the MI100 GPUs is the previous generation doing direct comparisons with the GPUs that LUMI-G will use is not straight forward. While it may seem like a good first approximation that one MI100 GPU is about on par with one of the two dies in an MI250X based on the number of compute units, that does not take into account: the increased FP64 and DGEMM performance, the performance achievable with packed FP32, the increased bfloat16 performance, or the increased memory bandwidth.
In the EAP nodes the four MI100 GPUs are connected to each other with direct infinity fabric links providing direct GPU to GPU link, each link provides 46 GB/s of bandwidth in each direction. Each of the MI250X GPUs in LUMI-G will show up as 2 devices, and the infinity fabric mesh in those nodes is more complex with more links, but the basic idea is the same with direct GPU to GPU links. The one thing the EAP nodes cannot emulate is the fast infinity fabric link between the two dies on one MI250X board.
The network in the EAP nodes is significantly different to that which the LUMI-G nodes will have, the EAP nodes have a single 100gbit network adapter, whereas the LUMI-G nodes will have 4x200gbit adapters. Hence any performance achieved by multi-node runs is not at all representative of the LUMI-G nodes, and multi-node runs should really only be done to test functionality.
About the programming environment¶
The programming environment of the EAP is still experimental and do not entirely reflect the final environment or performance of the progrmming environment that when LUMI-G will be available. Here is a list of the characteristics of the current platform that will be available once LUMI-G is available:
- OpenMP offload support will be available for C/C++ and Fortran with the Cray
compilers (PrgEnv-cray). The same is true for the AMD compilers. Currently the
AMD compilers are available by loading the
rocm
module. On the final system, they will be aPrgEnv-amd
module. - HIP code can be compiled with the Cray C++ compiler wrapper (
CC
) or with the AMDhipcc
compiler wrappper. - A GPU-aware MPI implementation is available (loading the
cray-mpich
module). You can use this MPI library with the Cray and AMD environment. Note that during testing the LUMI User Support Team noticed that the inter-node bandwidth is low (~4.5 GB/s). On the final system, it will be possible to initiate communication directly from the GPU (MPI call from HIP kernels). This feature is not available on the Early Access Platform.
Compiling HIP code¶
You have two options to compile you HIP code: you can use the ROCm AMD compiler or the Cray compiler.
To compile HIP code with the Cray C/C++ compiler, load the following modules in your environment.
The compilation flags to use to compile HIP code with the Cray C++
compiler wrappers (CC
) are the following
In addition, at the linking step, you need to link your application with the HIP library using the flags
Warning
Be careful and make sure to compile your code using the
--offload-arch=gfx908
flag in order to compile code optimized for the
MI100 GPU architecture. If you omit the flag, the compiler may optimize
your code for an older, less suitable architecture.
Compiling OpenMP offload code¶
You have to options available OpenMP offload code compilation. The first option is to use the Cray compilers and the second, to use the AMD compilers provided with ROCm.
To compile an OpenMP offload code with the Cray compilers, you need to load the following modules:
It is critical to load the craype-accel-amd-gfx908
module in order to make
the compiler wrappers aware that you target the MI100 GPUs. To compile the code,
use the Cray compiler wrappers: cc
(C), CC
(C++) and ftn
(Fortran) with
the -fopenmp
flag.
The AMD compilers are available by loading the rocm
module.
It will give you access to the amdclang
(C), amdclang++
(C++) and
amdflang
(Fortran) compilers. In order to compile OpenMP offload code, you
need to pass additional target flags to the compiler.
Compiling a HIP+MPI code¶
The MPI implementation available on LUMI is GPU-aware. It means that you can
pass a pointer to memory allocated on the GPU to the MPI calls. This MPI
implementation can be used by loading the cray-mpich
module loaded.
module load craype-accel-amd-gfx908
module load cray-mpich
module load rocm
export MPICH_GPU_SUPPORT_ENABLED=1
The Cray compiler wrappers will automatically link your application (acting
similarly to mpicc
) to the MPI library and Cray GPU transfer library
(libmpi_gtl
). Still, you need to use the flags presented in the
previous section in order to compile HIP code.
Warning
When running GPU-aware MPI code, you need to enable GPU support using
MPICH_GPU_SUPPORT_ENABLED=1
. Failing to do so will lead to failure if
your application use GPU pointers in MPI calls. This usually manifest as
a bus error. Note also that the Cray MPI do not support GPU-aware MPI for
multiple GPUs per rank, i.e. you should only use one GPU per MPI rank.
Submitting jobs¶
LUMI use Slurm as a job scheduler. If you are not familiar with Slurm, please read the Slurm quick start guide
To submit jobs to the Early Access platform you need to select the eap
partition and provide you project number. Below is an example job script to
launch an application with 2 MPI ranks with 8 threads and 1 GPU per rank.
#!/bin/bash
#SBATCH --partition=eap (1)
#SBATCH --account=<project_XXXXXXXXX> (2)
#SBATCH --time=10:00 (3)
#SBATCH --ntasks=2 (4)
#SBATCH --cpus-per-task=8 (5)
#SBATCH --gpus-per-task=1 (6)
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK # (7)
export MPICH_GPU_SUPPORT_ENABLED=1 # (8)
srun <executable> # (9)
-
Select the Early Access Partition
-
Change this value to match your project number. If you don't know your project number use the
groups
command. You should see that you are a member of a group looking like this:project_XXXXXXXXX
. -
The format for time is
dd-hh:mm:ss
. In this case, the requested time is 10 minutes. -
Request 2 tasks. A task in Slurm speaks is a process. If your application use MPI, it corresponds to the number of MPI ranks.
-
Request 8 threads per task. If your application is multithreaded (for example, using OpenMP) this is how you control the number of threads.
-
Request 2 GPUs for this job, one for each task. Most of the time the number of GPUs is the same as the number of tasks (MPI ranks).
-
If your application is multithreaded with OpenMP, set the value of
OMP_NUM_THREADS
to the value set with--cpus-per-task
-
If your code needs a GPU-aware MPI
-
Launch your application with
srun
. They are nompirun
/mpiexec
on LUMI. You should always usesrun
to launch your application. If your application doesn't use MPI you can omit it.
Once your job script is ready, you can submit your job using the sbatch
command.
More information about available batch script parameters is available here. The table below summarizes the GPU-specific options.
Option | Description |
---|---|
--gpus |
Specify the total number of GPUs across all nodes |
--gpus-per-task |
Specify the number of GPUs required for each task |
--gpus-per-node |
Specify the number of GPUs per node |
--ntasks-per-gpu |
Specify the number of tasks for every GPU |