Computing Cluster (SLURM)

The computing cluster

We have at CEREMADE a cluster for parallel computing.

Description of the structure

figure.png-1

The nodes

The cluster consists of 8 nodes (machines named clust1, clust2, etc.) of different configurations:

  • clust1: 40 CPU(s), Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz, 1 Tesla T4 GPU
  • clust2 : 40 CPU(s), Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz, 1 Tesla T4 GPU
  • clust3 : 40 CPU(s), Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz, 1 Tesla P4 GPU
  • clust4: 40 CPU(s), Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz, 1 Tesla P4 GPU
  • clust5: 40 CPU(s), Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz, 1 Tesla P4 GPU
  • clust6: 40 CPU(s), Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz, 1 Tesla P4 GPU
  • clust7: 40 CPU(s), Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz, 1 Tesla P4 GPU
  • clust8: 40 CPU(s), Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz, 1 Tesla P4 GPU

So a total of 320 CPUs !

For the ERC MDFT

  • clust9: 40 CPU(s), Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
  • clust10 : 40 CPU(s), Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz

SLURM and the cluster.ceremade.dauphine.lan machine

To submit a computation, the SLURM service was set up to manage the submitted computations. This was done by setting up a particular machine (front-end) named cluster.ceremade.dauphine.lan through which one must submit the desired calculation by requesting time and resources that will be managed by the SLURM service.

Copy files to cluster

For example, we can send our pi directory containing code, data, etc. via scp:

scp -r /home/chupin/pi/ chupin@cluster.ceremade.dauphine.lan:~/

It is also possible to do this via sFTP (using for example FileZilla).

Connect to the cluster machine

We connect to the cluster.ceremade.dauphine.lan machine with ssh:

If you access the cluster through the [VPN] access (https://www.ceremade.dauphine.fr/doc/fr/logiciels/vpn-dauphine) you must use the ip address : 10.101.7.5 rather than the dns name cluster.ceremade.dauphine.lan

ssh nomutilisateur@cluster.ceremade.dauphine.lan

or

ssh nomutilisateur@10.101.7.5

Send a calculation

To send a calculation, you have to write a SBATCH file which will tell Slurm what is needed and what commands are needed to execute the calculation.

SBATCH scripts are bash scripts that contain SBATCH commands as comments.

Packages and other libraries of non-compiled languages

If you use Python, Julia, R, and your code uses particular libraries, they are installed locally in your home (with pip, Pkg, etc.).

SBATCH script construction

Minimal example of SBATCH commands

A minimal example of SBATCH commands is provided here. All commands are described at the end of the page.

#!/bin/sh
# File submission.SBATCH
#SBATCH --nodes=1
#SBATCH -c 20
#SBATCH --job-name="MY_JOB"
#SBATCH --output=test.out
#SBATCH --mail-user=chupin@ceremade.dauphine.fr
#SBATCH --mail-type=BEGIN,END,FAIL

Some SBATCH environment variables

The following SBATCH csripts:

#!/bin/sh >
# File submission.SBATCH
#SBATCH --nodes=1
#SBATCH -c 20
#SBATCH --job-name="MY_JOB"
#SBATCH --output=%x.%J.out
#SBATCH --error=%x.%J.out
#SBATCH --mail-user=duleu@ceremade.dauphine.fr
#SBATCH --mail-type=BEGIN,FAIL,END

### Some info that may be useful >
echo Host `hostname`

### Total number of CPUs
echo It has been logged $SLURM_JOB_CPUS_PER_NODE cpus

### Definition of the env variable for OpenMP >
# $SLURM_JOB_CPUS_PER_NODE is the number of CPUs per node request >
OMP_NUM_THREADS=$SLURM_JOB_CPUS_PER_NODE
export OMP_NUM_THREADS
echo This job has $OMP_NUM_THREADS cpus

will produce the following result:

Host clust3
It has been allocated 20 cpus
This job has 20 cpus

Running the calculation program

To run a job from cluster.ceremade.dauphine.lan, use :

chupin@cluster:~/pi/> sbatch submission.SBATCH

Other SLURM tools to view, cancel, stop, etc. a job are described at the end of the page.

Complete examples

Example 1: example using OpenMP

Let's consider, for example, a C++ code compute_pi.cpp using the omp.h library and thus the OpenMP instructions to the compiler (but a Python code also fits in this frame).

Such a code must be compiled in the following way:

g++ -o compute_pi -fopenmp compute_pi.cpp 

Once this is done, we create a SBATCH script (in a file named for the example submission.SBATCH) which can look like :

chupin@cluster:~/pi/> sbatch submission.SBATCH

Example 2: example using Python (jupyter)

If you are not on the dauphine premises connected by ethernet cable, you must connect to the VPN.

To use Jupyter we need to go through an interactive session, by running this command:

srun --pty -c 10 -N 1 /bin/bash

If all goes well you should see your prompt change from:

duleu@cluster:~/code/test$

à

duleu@clust3:~/code/test$

You can see that the machine name is now clust3 and not cluster. We can now run a jupyter notebook:

jupyter notebook --ip=0.0.0.0

We get the address to copy and paste in a browser:

I 11:09:58.871 NotebookApp] JupyterLab extension loaded from /home/users/duleu/anaconda3/lib/python3.7/site-packages/jupyterlab
I 11:09:58.871 NotebookApp] JupyterLab application directory is /home/users/duleu/anaconda3/share/jupyter/lab
I 11:09:58.876 NotebookApp] Serving notebooks from local directory: /mnt/nfs/rdata02-users/users/duleu/code/test
I 11:09:58.876 NotebookApp] The Jupyter Notebook is running at:
[I 11:09:58.876 NotebookApp] http://clust3:8888/?token=673591c461f08d5773353be62416ba33a3468551a8926287
[I 11:09:58.876 NotebookApp] or http://127.0.0.1:8888/?token=673591c461f08d5773353be62416ba33a3468551a8926287
I 11:09:58.876 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
W 11:09:58.939 NotebookApp] No web browser found: could not locate runnable browser.
C 11:09:58.939 NotebookApp] 

    To access the notebook, open this file in a browser:
        file:///mnt/nfs/rdata02-users/users/duleu/.local/share/jupyter/runtime/nbserver-1947447-open.html
    Or copy and paste one of these URLs:
        http://clust3:8888/?token=673591c461f08d5773353be62416ba33a3468551a8926287
     or http://127.0.0.1:8888/?token=673591c461f08d5773353be62416ba33a3468551a8926287

You must add .ceremade.dauphine.lan after clust3.

Example 3: Python example

We want to run our python program script.py. To do this, we can use the SBATCH file below (note the export of the OpenMP variable).

#!/bin/sh
# file submission.SBATCH
#SBATCH --nodes=1
#SBATCH -c 20
#SBATCH --job-name="Test_Python"
#SBATCH --output=%x.%J.out
#SBATCH --time=10:00
#SBATCH --error=%x.%J.out
#SBATCH --mail-user=<user>@ceremade.dauphine.fr
#SBATCH --mail-type=BEGIN,FAIL,END

### Definition of the env variable for OpenMP >
# $SLURM_JOB_CPUS_PER_NODE is the number of CPUs per node request >
OMP_NUM_THREADS=$SLURM_JOB_CPUS_PER_NODE
export OMP_NUM_THREADS

python3 script.py

Example 4: Example using Matlab

Matlab is installed on all the nodes of the cluster. So we can use it. Let's suppose that in our working directory, we have a script script.m that we want to run.

#!/bin/sh >
# File submission.SBATCH
#SBATCH --nodes=1
#SBATCH -c 20
#SBATCH --job-name="MY_JOB"
#SBATCH --output=%x.%J.out
#SBATCH --error=%x.%J.out
#SBATCH --mail-user=<name>@ceremade.dauphine.fr
#SBATCH --mail-type=BEGIN,END,FAIL

# we execute the matlab program but without graphical interface
matlab -nodisplay -nodesktop -r "run('script.m')"

This is a bash script whose comments starting with #SBATCH are commands for SLURM. Here, the name of the calculation is MON_JOB.

Warning: here, we have taken 20 CPUs with 1 "node" (it is SLURM that manages the choice of machines and CPUs), in fact to be able to use the 40 threads available, we have to do multithreading with Matlab, and we don't know how to do it without a graphical interface.

Warning: on some accounts, matlab is not accessible, and you have to specify the full path of the executable:

/usr/local/bin/matlab -nodisplay -nodesktop -r "run('script.m')"

Once these files are on the cluster machine in a directory in its home, we submit the job using the following command:

chupin@cluster:~/codematlab/> sbatch submission.SBATCH

Example 5 : example using a GPU (graphics card)

To request GPU resources it is necessary to add this information in the pbs file or during an interactive session.

Here is an example of a file with a GPU resource request:

#!/bin/bash
#!/bin/bash >
# File submission.SBATCH
#SBATCH --nodes=1
#SBATCH -c 20
#SBATCH --gres=gpu:1
#SBATCH --job-name="MY_JOB"
#SBATCH --output=%x.%J.out
#SBATCH --error=%x.%J.out
#SBATCH --mail-user=<name>@ceremade.dauphine.fr
#SBATCH --mail-type=BEGIN,END,FAIL

# For OpenMP export 
OMP_NUM_THREADS=$SLURM_JOB_CPUS_PER_NODE
export OMP_NUM_THREADS

# we move to the SLURM directory
cd $SLURM_SUBMIT_DIR
# execute the matlab program but without the graphical interface
matlab -nodisplay -nodesktop -r "run('script.m')"

the necessary instruction is the following: --gres=gpu:1.

SBATCH options

Option Description
#SBATCH --partition=<part> Choose the partition to use for the job
#SBATCH --job-name=<name> Defines the name of the job as it will be displayed in the various Slurm commands (squeue, sstat, sacct)
#SBATCH --output=<stdOutFile> The standard output (stdOut) will be redirected to the file defined by "--output" or, if not defined, a default file "slurm-%j.out" (Slurm will replace "%j" with the JobID).
#SBATCH --error=<stdErrFile> The error output (stdErr) will be redirected to the file defined by "--error" or, if not defined, to the standard output.
#SBATCH --input=<stdInFile> The standard input can also be redirected with "--input". By default "/dev/null" is used (none/empty).
#SBATCH --open-mode=<append,truncate> The option "--open-mode" defines the mode of opening (writing) files and behaves like an open/fopen of most programming languages (2 possibilities: "append" to write after the file (if it exists) and "truncate" to overwrite the file at each batch execution (default value)).
#SBATCH --mail-user=<e-mail> Defines the e-mail address of the recipient
#SBATCH --mail-type=<BEGIN,END,FAIL,TIME_LIMIT,TIME_LIMIT_50,...> Allows to be notified by e-mail of a particular event in the life of the job : beginning of the execution (BEGIN), end of the execution (END, FAIL and TIME_LIMIT)... See the Slurm documentation for the complete list of supported events.
#SBATCH --cpus-per-task=<n> Defines the number of CPUs to allocate per Task. The actual use of these CPUs is up to each Task (creation of processes and/or threads).
#SBATCH --ntasks=<n> Defines the maximum number of Tasks executed in parallel.
#SBATCH --mem-per-cpu=<n> Defines the RAM in MB allocated to each CPU. By default, 4096 MB are allocated to each CPU, using this variable allows to specify a specific RAM size, less or equal to 7800 MB (maximum allocable per CPU).
#SBATCH --nodes=<minnodes[-maxnodes]> Minimum [-maximum] number of nodes on which to distribute Tasks.
#SBATCH --ntasks-per-node=<n> When used in conjunction with --nodes, this option is an alternative to --ntasks that allows you to control the distribution of Tasks to individual nodes.

Note

You can specify exactly which nodes you want to use with different numbers of processors on each with the option :

#PBS -l nodes=1:ppn=5:clust8

This allows to explicitly choose the clust8 node, with 5 threads for this node. Of course, this is not recommended, TORQUE handles the job distribution.

PBS environment variables

Variable name Description Example
SLURM_JOB_ID The job identifier (calculation) 12345
SLURM_JOB_NAME The name of the job defined with the -J option my_job
SLURM_JOB_NODELIST Is the name of a file that is made by SLURM and contains the list of nodes used.
SLURM_SUBMIT_HOST Name of the host on which sbatch was run (in our case cluster) cluster
SLURM_SUBMIT_DIR Directory from which the job is submitted /home/user/chupin/scripts_pbs
SLURM_JOB_NUM_NODES Number of nodes required for the job (e.g. with -N 5)
SLURM_NTASKS_PER_NODE Number of threads (cores) per node required for the job (for example with -N 20 -n 8)

Viewing jobs

To list the computations launched on the cluster, we use the smap program:

    chupin@cluster:~/pi/> smap -i 1

which produces something like :

┌─────────────────────────────────────────────────────────────────────────────────────┐
│..B.......                                                                           │
│                                                                                     │
│                                                                                     │
│                                                                                     │
│                                                                                     │
│                                                                                     │
│                                                                                     │
│                                                                                     │
└─────────────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────────────┐
│Tue Mar 16 16:57:10 2021                                                             │
│ID JOBID              PARTITION USER     NAME      ST      TIME NODES NODELIST       │
│A  101                debug     duleu    MON_JOB   R   00:00:20     1 clust3         │
│B  102                debug     duleu    MON_JOB   R   00:00:16     1 clust3         │
│C  103                debug     duleu    MON_JOB   PD  00:00:00     1 waiting...     │
│D  104                debug     duleu    MON_JOB   PD  00:00:00     1 waiting...     │
│E  105                debug     duleu    MON_JOB   PD  00:00:00     1 waiting...     │
│F  106                debug     duleu    MON_JOB   PD  00:00:00     1 waiting...     │
│G  107                debug     duleu    MON_JOB   PD  00:00:00     1 waiting...     │
│H  108                debug     duleu    MON_JOB   PD  00:00:00     1 waiting...     │
│I  109                debug     duleu    MON_JOB   PD  00:00:00     1 waiting...     │
│J  110                debug     duleu    MON_JOB   PD  00:00:00     1 waiting...     │
│                                                                                     │
└─────────────────────────────────────────────────────────────────────────────────────┘

To see the occupancy rate of the nodes we use the command pestat :

    chupin@cluster:~/pi/> pestat

Here is an example of what is displayed:

Hostname Partition Node Num_CPU CPUload Memsize Freemem Joblist
                            State Use/Tot (MB) (MB) JobId User ...
  clust1 debug* down* 0 40 0.00* 1 0   
  clust2 debug* down* 0 40 0.00* 1 0   
  clust3 debug* idle 0 40 0.20 1 87878   
  clust4 debug* down* 0 40 0.00* 1 0   
  clust5 debug* down* 0 40 0.00* 1 0   
  clust6 debug* down* 0 40 0.00* 1 0   
  clust7 debug* down* 0 40 0.00* 1 0   
  clust8 debug* down* 0 40 0.00* 1 0   
  clust9 erc down* 0 40 0.00* 1 0   
 clust10 erc down* 0 40 0.00* 1 0   

Other programs

Other programs are available to handle the calculations run by qsub. In particular :

  • sbatch which allows to submit a calculation.
  • sstat which examines the status of a job. The ID given in the #JOBID column of smap is required.
        chupin@cluster:~/pi/> sstat 150
  • scancel which deletes a job. The ID given in the #JOBID column of smap is required.