Skip to content

AI Software Environment

The AI Software Environment by LUMI AI Factory is a comprehensive, ready-to-use containerised stack for AI and machine learning workloads on the LUMI supercomputer. The environment is designed to address the complexity of deploying and maintaining AI/ML software in high-performance computing (HPC) setting.

All build artifacts are publicly available. This includes the full recipe, Containerfiles, build logs, and the resulting final container images. This transparent approach enables full customization for special use cases, reuse on other similar systems, as well as adapting the images to run on cloud environments.

Available container images

At the moment each release includes the following container images, each building on the previous one by adding new major functionality in the following order:

  1. lumi-multitorch-rocm-*: Starts from Ubuntu base image and adds ROCm
  2. lumi-multitorch-libfabric-*: Adds libfabric to the ROCm image
  3. lumi-multitorch-mpich-*: Adds MPICH with GPU support to the libfabric image
  4. lumi-multitorch-torch-*: Adds PyTorch to the MPICH image
  5. lumi-multitorch-full-*: Adds selection of AI and ML libraries (e.g., Bitsandbytes, DeepSpeed, Flash Attention, Megatron LM, vLLM) to the PyTorch image

The releases on GitHub also include full details of the included software of each image.

The name of the container includes a timestamp and version identifier. It is explained in the releases on GitHub.

For users running AI applications based on PyTorch, the containers starting with lumi-multitorch-full-* are most likely the best starting point. Advanced users can build on intermediate containers to customize to their use cases.

Access to container images

The container images are available from the following locations:

The GitHub releases in a public GitHub repository include full details of the provided container images.

Examples for using the container images

This list only includes some examples for using the container images. More examples can be found in the LUMI AI guide.

lumi-aif-singularity-bindings module

To give LUMI containers access to the file system of the working directory, some additional bindings are required. As it can be quite cumbersome to set these bindings manually, we provide a module that does this for you. You can load the module with the following commands:

module purge
module use /appl/local/laifs/modules
module load lumi-aif-singularity-bindings

If you prefer to set the bindings manually, we recommend taking a look at the Running containers on LUMI lecture from the LUMI AI workshop material.

Run PyTorch using the container

module purge
module use /appl/local/laifs/modules
module load lumi-aif-singularity-bindings
export SIF=/appl/local/laifs/containers/lumi-multitorch-u24r64f21m43t29-20260124_092648/lumi-multitorch-full-u24r64f21m43t29-20260124_092648.sif
srun -A <your-project-id> -p small-g -n 1 --gpus-per-task=1 singularity run $SIF python -c "import torch; print(torch.cuda.device_count())"

List pip packages in container

To inspect which specific packages are included in the images you can use this simple command:

export SIF=/appl/local/laifs/containers/lumi-multitorch-u24r64f21m43t29-20260124_092648/lumi-multitorch-full-u24r64f21m43t29-20260124_092648.sif
singularity run $SIF pip list

Alternatively, you can have a look at the software bill of materials (SBOM) .json file in the releases on GitHub or in the directory of the container on LUMI.

Add more pip packages to container

You might find yourself in a situation where none of the provided containers contain all Python packages you need. There are multiple ways of adding more pip packages:

For this example, we need to add the HDF5 Python package h5py to the environment:

module purge
module use /appl/local/laifs/modules
module load lumi-aif-singularity-bindings
export SIF=/appl/local/laifs/containers/lumi-multitorch-u24r64f21m43t29-20260124_092648/lumi-multitorch-full-u24r64f21m43t29-20260124_092648.sif
singularity shell $SIF
Singularity> python -m venv h5-env --system-site-packages
Singularity> source h5-env/bin/activate
(h5-env) Singularity> pip install h5py

This will create an h5-env environment in the working directory. The --system-site-packages flag gives the virtual environment access to the packages from the container.

Strain on Lustre file system

Installing Python packages typically creates thousands of small files. This puts a lot of strain on the Lustre file system and might exceed your file quota. This problem can be solved by building new containers based on the images.

Now one can execute a script with and import the h5py package. To execute a script called my-script.py within the container using the virtual environment, use the additional activation command:

export SIF=/appl/local/laifs/containers/lumi-multitorch-u24r64f21m43t29-20260124_092648/lumi-multitorch-full-u24r64f21m43t29-20260124_092648.sif
singularity run $SIF bash -c 'source h5-env/bin/activate && python my-script.py'

Build new containers based on the images

It is possible to create new containers based on the existing containers. In general, the instructions to extend singularity images can be followed.

GPU support and communication libraries

Note that for some packages installing for GPU or the correct communication libraries might require significant work. We recommend starting with a suitable image that has most of the required software to mitigate the risk of that happening.

To build a new container the following steps are required:

  1. Check what software you want to install.

    Note: For a few small pip packages a virtual environment might be sufficient.

  2. Identify the correct base container from the releases on GitHub.

    Examples:

    • For installing an alternative deep learning framework like JAX, start with the lumi-multitorch-mpich-* image

    • For installing a system package while requiring PyTorch and vLLM start with the lumi-multitorch-full-* image.

  3. Create a Singularity definition file.

    Examples:

    • Add pip package like scikit-learn to lumi-multitorch-torch-*:

      Save the following as scikit.def.

      Bootstrap: docker
      From: docker.io/lumiaifactory/lumi-multitorch:torch
      
      %post
          . /opt/venv/bin/activate
          pip install scikit-learn
      

    • Add some system package like nvtop to lumi-multitorch-full-*:

      Save the following as nvtop.def.

      Bootstrap: docker
      From: docker.io/lumiaifactory/lumi-multitorch:full
      
      %post
          apt-get install -y nvtop
      

  4. Build the container using the instructions to extend singularity images. This is possible either on your own hardware or directly on LUMI using PRoot.

    Memory requirements

    Creating new containers based on the provided containers might require significant memory. One option is to use an interactive slurm job with sufficient memory.

    Example:

    • singularity build scikit.sif scikit.def
    • singularity build nvtop.sif nvtop.def
  5. Check if the package has been installed correctly. It depends on the specific package installed how to do that, but for packages using GPUs it might be worth checking if the GPU is being detected. For packages using multiple nodes it is recommended to verify that all nodes are used by your package.

More information

You are viewing a development build

The content on this page has not been approved for release.