TensorFlow on LUMI¶

TensorFlow is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries, and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML-powered applications.

TensorFlow can be installed by following the official instructions for installing a ROCm compatible TensorFlow via pip. Please consult the Python packages installation guide for an overview of recommended ways to manage pip installations on LUMI. Alternatively, container images made specifically for running TensorFlow on LUMI may be used.

Running TensorFlow within containers provided by LUMI¶

The LUMI software library includes EasyBuild recipes for TensorFlow containers. The TensorFlow containers are developed by AMD specifically for LUMI and contain the necessary parts to run TensorFlow and Horovod on LUMI, including the plugin needed for RCCL when doing distributed AI, and a suitable version of ROCm for the version of TensorFlow. The images for the containers are also available on LUMI at /appl/local/containers/sif-images/ and definition files and more can be found in this GitHub repository.

The container uses a miniconda environment in which Python and its packages are installed. That environment needs to be activated in the container when running, which can be done with the command that is available in the container as the environment variable $WITH_CONDA (which for this container is source /opt/miniconda3/bin/activate tensorflow).

The information about extending these containers can be found in the Singularity/Apptainer section. Below you can find a base environment for running the native TensorFlow example below. This can be extended to your needs.

name: py310_rocm551_tf211
channels:
  - conda-forge
dependencies:
  - python=3.10
  - protobuf=3.19.6
  - pip=23.3.1
  - pip:
    - tensorflow-rocm==2.11.1.550

In short, the enviroment file can be used like this to create a new container with the described conda environment.

module load LUMI/23.03 partition/C
module load cotainr/2023.11.0-cray-python-3.9.13.1
cotainr build lumi-sfantao-pytorch-lumi-base.sif --base-image=/appl/local/containers/sif-images/lumi-tensorflow-rocm-5.5.1-python-3.10-tensorflow-2.11.1-horovod-0.28.1.sif --conda-env=env.yml

Multi-GPU training¶

There are a few ways to do Multi-GPU training with TensorFlow. One of the most common distribution methods for TensorFlow is Horovod, which is included in the LUMI containers mentioned above. TensorFlow also provides native distribution methods through tf.distribute.MultiWorkerMirroredStrategy. It implements synchronous distributed training across multiple workers, each with potentially multiple GPUs.

The communication between LUMI's GPUs during training with Pytorch is done via RCCL, which is a library of collective communication routines for AMD GPUs. RCCL works out of the box on LUMI, however, a special plugin is required so it can take advantage of the Slingshot 11 interconnect. That's the aws-ofi-rccl plugin, which is a library that can be used as a back-end for RCCL to interact with the interconnect via libfabric. Using the containers provide by LUMI, this plugin is built into the container.

Examples¶

You can execute the test case by downloading the MultiWorkerMirroredStrategy or Horovod test script

We will be using a bash script to set environment variables and eventually call the TensorFlow code saved as tf2_distr.py.

run.sh¶

#!/bin/bash -e
cd /workdir

# The line below should be remove if you built a new container with cotainr
$WITH_CONDA
set -x
echo $SLURM_LOCALID

# export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=hsn0,hsn1,hsn2,hsn3
export NCCL_NET_GDR_LEVEL=3
export MIOPEN_USER_DB_PATH="/tmp/${whoami}-miopen-cache-$SLURM_NODEID"
export MIOPEN_CUSTOM_CACHE_DIR=$MIOPEN_USER_DB_PATH
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

export TF_CPP_MAX_VLOG_LEVEL=-1

rocm-smi
# Set MIOpen cache out of the home folder.
if [ $SLURM_LOCALID -eq 0 ] ; then
  rm -rf $MIOPEN_USER_DB_PATH
  mkdir -p $MIOPEN_USER_DB_PATH
fi
sleep 3

# Report affinity
echo "Rank $SLURM_PROCID --> $(taskset -p $$)"

python tf2_distr.py --batch-size=256

We have used a few environment variables in the run.sh script. The ones starting with NCCL_ are used by RCCL for the communication over Slingshot. The MIOPEN_ ones are needed to make MIOpen create its caches on /tmp. The NCCL_NET_GDR_LEVEL variable allows the user to finely control when to use GPU Direct RDMA between a NIC and a GPU. We have found 3 to be a good value for this, but it is value to experiment with. Setting CUDA_VISIBLE_DEVICES is not necessary if using the Horovod code.

Additional TensorFlow debug messages can be seen if you remove TF_CPP_MAX_VLOG_LEVEL=-1. In addition, NCCL_DEBUG=INFO, can be used to increase RCCL's logging level to make sure that the aws-ofi-rccl plugin is being used. To verify, you should be able to see the following lines somewhere in the output

NCCL INFO NET/OFI Using aws-ofi-rccl 1.4.0
NCCL INFO NET/OFI Selected Provider is cxi

Below we bind some necessary components for the container and set proper NUMA node to GPU affinity.

If the job hangs with MultiWorkerMirroredStrategy, you might need to bind mount a newer version of TensorFlow Slurm cluster resolver as is done in the batch script below.

#!/bin/bash
#SBATCH -p standard-g
#SBATCH -N 2
#SBATCH -n 16
#SBATCH --ntasks-per-node 8
#SBATCH --gpus-per-task 1
#SBATCH --threads-per-core 1
#SBATCH --exclusive
#SBATCH --gpus 16
#SBATCH --mem 0 
#SBATCH -t 0:15:00
#SBATCH --account=project_<your_project_id>

wd=$(pwd)
SIF=/appl/local/containers/sif-images/lumi-tensorflow-rocm-5.5.1-python-3.10-tensorflow-2.11.1-horovod-0.28.1.sif


c=fe
MYMASKS="0x${c}000000000000,0x${c}00000000000000,0x${c}0000,0x${c}000000,0x${c},0x${c}00,0x${c}00000000,0x${c}0000000000"
srun --cpu-bind=mask_cpu:$MYMASKS \
  singularity exec \
    -B /var/spool/slurmd:/var/spool/slurmd \
    -B /opt/cray:/opt/cray \
    -B /usr/lib64/libcxi.so.1:/usr/lib64/libcxi.so.1 \
    -B /usr/lib64/libjansson.so.4 \
    -B slurm_cluster_resolver.py:/opt/miniconda3/envs/tensorflow/lib/python3.10/site-packages/tensorflow/python/distribute/cluster_resolver/slurm_cluster_resolver.py \
    -B $wd:/workdir \
    $SIF /workdir/run.sh