PyTorch on LUMI¶
PyTorch is an open source Python package that provides tensor computation, like NumPy, with GPU acceleration and deep neural networks built on a tape-based autograd system.
PyTorch can be installed by the users following the code's instructions. The options to choose for LUMI in the interactive instructions are Linux
, Pip
and ROCm5.X
. For installing with pip
, the cray-python
module should be loaded. PyTorch comes with ROCm binaries needed for the GPU support. Even if a particular version of ROCm is not available on LUMI, PyTorch may still be able to use the GPUs.
PyTorch can be run within containers as well. In particular, containers from the images provided by AMD on DockerHub. Those images are updated frequently and make it possible to try PyTorch with recent ROCm versions. Another point in favor of using containers, is that PyTorch's installation directory can be quite large both in terms of storage size and number of files.
Running PyTorch within containers¶
We recommend using container images from rocm/pytorch
or rocm/deepspeed
.
The images can be fetched with singularity:
SINGULARITY_TMPDIR=$SCRATCH/tmp-singularity singularity pull docker://rocm/pytorch:rocm5.4.1_ubuntu20.04_py3.7_pytorch_1.12.1
pytorch_rocm5.4.1_ubuntu20.04_py3.7_pytorch_1.12.1.sif
on the directory where the command was run. After the image has been pulled, the directory $SCRATCH/tmp-singularity singularity
can be removed.
Installing other packages along the container's PyTorch installation¶
Often we may need to install other packages to be used along PyTorch.
That can be done by creating a virtual environment within the container in a host directory.
This can be done by running the container interactively and creating a virtual environment in your $HOME
.
As an example, let's do that to install the package python-hostlist
:
$> singularity exec -B $SCRATCH:$SCRATCH pytorch_rocm5.4.1_ubuntu20.04_py3.7_pytorch_1.12.1.sif bash
Singularity> python -m venv pt_rocm5.4.1_env --system-site-packages
Singularity> . pt_rocm5.4.1_env/bin/activate
(pt_rocm5.4.1_env) Singularity> pip install python-hostlist
Multi-GPU training¶
The communication between LUMI's GPUs during training with Pytorch is done via RCCL, which is a library of collective communication routines for GPUs. RCCL works out of the box on LUMI's, however, a special plugin is required so it can take advantage of the Slingshot interconnect. That's the aws-ofi-rccl
plugin, which is a library that can be used as a back-end for RCCL to interact with the interconnect via libfabric.
The aws-ofi-rccl
plugin can be installed by the user with EasyBuild:
module load LUMI/22.08 partition/G
module load EasyBuild-user
eb aws-ofi-rccl-66b3b31-cpeGNU-22.08.eb -r
aws-ofi-rccl
will add the path to the library to the LD_LIBRARY_PATH
so RCCL can detect it.
Example¶
Let's now consider an example to test the steps above. We will use the script cnn_distr.py which uses the pt_distr_env.py module to setup PyTorch's distributed environment. That module is based on python-hostlist
, which we installed earlier.
The Slurm submission script can be something like this:
#!/bin/bash
#SBATCH --job-name=pt-cnn
#SBATCH --ntasks=32
#SBATCH --ntasks-per-node=8
#SBATCH --time=0:10:0
#SBATCH --exclusive
#SBATCH --partition standard-g
#SBATCH --account=<project>
#SBATCH --gpus-per-node=8
module load LUMI/22.08 partition/G
module load singularity-bindings
module load aws-ofi-rccl
. ~/pt_rocm5.4.1_env/bin/activate
export NCCL_SOCKET_IFNAME=hsn
export NCCL_NET_GDR_LEVEL=3
export MIOPEN_USER_DB_PATH=/tmp/${USER}-miopen-cache-${SLURM_JOB_ID}
export MIOPEN_CUSTOM_CACHE_DIR=${MIOPEN_USER_DB_PATH}
export CXI_FORK_SAFE=1
export CXI_FORK_SAFE_HP=1
export FI_CXI_DISABLE_CQ_HUGETLB=1
export SINGULARITYENV_LD_LIBRARY_PATH=/opt/ompi/lib:${EBROOTAWSMINOFIMINRCCL}/lib:/opt/cray/xpmem/2.4.4-2.3_9.1__gff0e1d9.shasta/lib64:${SINGULARITYENV_LD_LIBRARY_PATH}
srun singularity exec -B"/appl:/appl" \
-B"$SCRATCH:$SCRATCH" \
$SCRATCH/pytorch_rocm5.4.1_ubuntu20.04_py3.7_pytorch_1.12.1.sif python cnn_distr.py
NCCL_
and CXI_
, as well as FI_CXI_DISABLE_CQ_HUGETLB
are used by RCCL for the communication over Slingshopt. The MIOPEN_
ones are needed to make MIOpen create its caches on /tmp
. Finally, with SINGULARITYENV_LD_LIBRARY_PATH
some directories are included in the container's LD_LIBRARY_PATH
. This is important for RCCL to find the aws-ofi-rccl
plugin. In addition, NCCL_DEBUG=INFO
, can be used to increase RCCL's logging level to make sure that the aws-ofi-rccl
plugin is being used: The lines
and
should appear in the output.
After running the script above, the output should include something like this