Example of a minimal program using the nvhpc stack with CUDA aware MPI
About NVHPC
NVHPC is an integrated collection of software tools and libraries distributed by NVidia. An overview document describing nvhpc can be found here. The aim of the NVHPC team is to provide up to date, preconfigured suites of compilers, libraries and tools that are specifically optimized for NVidia GPU hardware. It supports single and multi-GPU execution.
Basic Usage Example
This example shows steps for using NVHPC to run a simple test MPI program, written in C, that communicates between two GPUs. The detailed steps, that can be executed in an interactive Slurm session, are explained below. A complete Slurm job example is shown at the end.
Prerequisites
This example assumes you have access to a Slurm partition with GPU resources and if using Engaging are working with a Rocky Linux environment.
1. Activate the relevant NVHPC module
The NVHPC environment is installed as a module and can be made visible in a session using the command
this will add a specific version of the nvhpc software (version 23.3 released in 2023 for Engaging and version 21.5 released in 2021 for Satori) to a shell or batch script. The software added includes compilers for C, C++ and Fortran; base GPU optimized numerical libraries for linear algebra, Fourier transforms and others; GPU optimized communication libraries supporting MPI, SHMEM and NCCL APIs.
An environment variable, NVHPC_ROOT
, is also set. This can be used in scripts to reference the locations of libraries
when needed.
2. Set paths needed for compile step
Here we use the module environment variable, NVHPC_ROOT
, to set environment variables
that have paths needed for compilation and linking of code.
3. Create a C program for that executes some simple multi-node, multi-GPU test code.
The next step is to create a file holding C code that uses MPI to send information between two GPUs
running in different processes. Paste the C code below into a file called test.c
.
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#include <cuda_runtime.h>
int main(int argc, char *argv[])
{
int myrank, mpi_nranks;
int LBUF=1000000;
float *sBuf_h, *rBuf_h;
float *sBuf_d, *rBuf_d;
int bSize;
int i;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank); // my MPI rank
MPI_Comm_size(MPI_COMM_WORLD, &mpi_nranks);
if ( mpi_nranks != 2 ) { printf("Program requires exactly 2 ranks\n");exit(-1); }
int deviceCount;
cudaGetDeviceCount(&deviceCount); // How many GPUs?
printf("Number of GPUs found = %d\n",deviceCount);
int device_id = myrank % deviceCount;
cudaSetDevice(device_id); // Map MPI-process to a GPU
printf("Assigned GPU %d to MPI rank %d of %d.\n",device_id, myrank, mpi_nranks);
// Allocate buffers on each host and device
bSize = sizeof(float)*LBUF;
sBuf_h = malloc(bSize);
rBuf_h = malloc(bSize);
for (i=0;i<LBUF;++i){
sBuf_h[i]=(float)myrank;
rBuf_h[i]=-1.;
}
if ( myrank == 0 ) {
printf("rBuf_h[0] = %f\n",rBuf_h[0]);
}
cudaMalloc((void **)&sBuf_d,bSize);
cudaMalloc((void **)&rBuf_d,bSize);
cudaMemcpy(sBuf_d,sBuf_h,bSize,cudaMemcpyHostToDevice);
if ( myrank == 0 ) {
MPI_Recv(rBuf_d,LBUF,MPI_REAL,1,0,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
}
else if ( myrank == 1 ) {
MPI_Send(sBuf_d,LBUF,MPI_REAL,0,0,MPI_COMM_WORLD);
}
else
{
printf("Unexpected myrank value %d\n",myrank);
exit(-1);
}
cudaMemcpy(rBuf_h,rBuf_d,bSize,cudaMemcpyDeviceToHost);
if ( myrank == 0 ) {
printf("rBuf_h[0] = %f\n",rBuf_h[0]);
}
MPI_Barrier(MPI_COMM_WORLD);
MPI_Finalize();
}
4. Compile program
Here we use nvhpc MPI wrapper to compile. The two environment variables we set earlier (cuincdir
and culibdir
) are used to
let the compile step know where to find the relevant CUDA header and library files. The CUDA runtime library (cudart
) is added
as a location for finding CUDA functions the code utilizes.
5. Execute program
Once code has been compiled the mpiexec
command that is part of the nvhpc
module can be used to run the test program.
The nvhpc
module defaults to using its builtin version of OpneMPI. The OpenMPI option btl_openib_warn_no_device_params_found
is passed into the OpenMPI runtime library. This option suppresses a warning that OpenMPI can generate when it encounters
a network device card that is not present in a built-in list that OpenMPI has historically included.
Running this program using the command above should produce the following output.
Number of GPUs found = 1
Number of GPUs found = 1
Assigned GPU 0 to MPI rank 0 of 2.
rBuf_h[0] = -1.000000
Assigned GPU 0 to MPI rank 1 of 2.
rBuf_h[0] = 1.000000
Example of Slurm job file for executing this example
First create a file called "test.c" containing the example C program above. The job script file below will run all the steps described above for "test.c". It can be submitted to Slurm using the command sbatch
followed by the filename holding the job script.
#!/bin/bash
#SBATCH -p sched_system_all
#SBATCH --constraint=rocky8
#SBATCH -N 2
#SBATCH -n 2
#SBATCH --gres=gpu:2
#SBATCH --time=00:02:00
#
# Basic slurm job that tests GPU aware MPI in the NVHPC tool stack.
#
#
# To submit through Slurm use:
#
# $ sbatch test_cuda_and_mpi.sbatch
#
# in terminal.
# Write a little log info
echo "## Start time \""`date`"\""
echo "## Slurm job running on nodes \"${SLURM_JOB_NODELIST}\""
echo "## Slurm submit directory \"${SLURM_SUBMIT_DIR}\""
echo "## Slurm submit host \"${SLURM_SUBMIT_HOST}\""
echo " "
module load nvhpc/2023_233/nvhpc/23.3
culibdir=$NVHPC_ROOT/cuda/lib64
cuincdir=$NVHPC_ROOT/cuda/include
mpicc test.c -I${cuincdir} -L${culibdir} -lcudart
mpiexec --mca btl_openib_warn_no_device_params_found 0 -n 2 ./a.out
#!/bin/bash
#SBATCH -n 2
#SBATCH --gres=gpu:2
#SBATCH --time=00:02:00
#
# Basic slurm job that tests GPU aware MPI in the NVHPC tool stack.
#
#
# To submit through Slurm use:
#
# $ sbatch test_cuda_and_mpi.sbatch
#
# in terminal.
# Write a little log info
echo "## Start time \""`date`"\""
echo "## Slurm job running on nodes \"${SLURM_JOB_NODELIST}\""
echo "## Slurm submit directory \"${SLURM_SUBMIT_DIR}\""
echo "## Slurm submit host \"${SLURM_SUBMIT_HOST}\""
echo " "
module load nvhpc/21.5
culibdir=$NVHPC_ROOT/cuda/lib64
cuincdir=$NVHPC_ROOT/cuda/include
mpicc test.c -I${cuincdir} -L${culibdir} -lcudart
mpiexec --mca btl_openib_warn_no_device_params_found 0 -n 2 ./a.out