Sharing GPU Nodes with Torque

On most of our clusters we have some nodes that contain GPUs. In the case of this particular user request, the cluster is running torque with Moab. The symptom was that about half the time a job would fail with the error:

Error using gpuDevice (line 26)
An unexpected error occurred during CUDA execution. The CUDA error was:
all CUDA-capable devices are busy or unavailable

Error in PROGRAM (line 71)
g = gpuDevice(1);

The program is Matlab which, following it's Fortran heritage, and unlike many other languages uses one based arrays. So the line is assigning the first GPU device to the variable g. The jobs were being submitted to the correct queue which has nodes with two GPUs per node and since it needed only one GPU was correctly requesting that much resource in the submission script:

#!/bin/bash
#PBS -q gpu
#PBS -l nodes=1:ppn=1:gpus=1
#PBS -l walltime=30:00:00
#PBS -N JOBNAME

cd $PBS_O_WORKDIR

matlab -r PROGRAM

The problem is that occasionally another job (belonging to another user, or even another instance of this user's single GPU job) would already be running on a node. The new job would start and Matlab would see both GPUs and attempt to access the first one - even if that wasn't the GPU assigned to the job.
It is possible to set an environment variable that "masks" out the GPUs that should not be seen. The variable is CUDA_VISIBLE_DEVICES. Unfortunately the version of Torque we were using does not set that variable. It does write a file that lists the nodes and the GPUs assigned to a job an sets and environment variable PBS_GPUFILE which points at that file.
So the fix was to add the following to the job submission script above the line where matlab is called:

export CUDA_VISIBLE_DEVICES=$(grep ${HOSTNAME} ${PBS_GPUFILE} | awk -F"-gpu" '{printf A$2;A=","}')

The even better fix is to put this in a system prologue script so that it is fixed for everyone or update to a newer version of Torque that does set the variable

links

social