Determining resources needed for your Slurm job
If you are unsure of how many resources to allocate for your job, there are a few ways to figure out how best to tailor your sbatch script to run more efficently. One way is to run your job with the bare bones flags, such as only specifying the partition you want to run on and setting a constraint to only use Centos 7 or Centos 6 nodes.
Once your job has run, you can view information about it’s efficenty, memory used, and CPU utilization with the command:
seff [jobid]
This will give you output such as:
Job ID: 123123123
Cluster: eofe7
User/Group: user/user
State: COMPLETED (exit code 0)
Nodes: 16
Cores per node: 8
CPU Utilized: 01:26:44
CPU Efficiency: 0.76% of 7-21:47:44 core-walltime
Job Wall-clock time: 01:28:58
Memory Utilized: 51.45 GB (estimated maximum)
Memory Efficiency: 0.0531% of 960.00 GB (60.00 GB/node)
In this example, you can see that the job used a small fraction of the total memory requested.
If you are still unsure of if you are requesting enough resources for your job, read more below about how to properly utilize each option of Slurm’s sbatch flags.
The number of nodes your job requires will be dependent on many factors, including the overall size of the job you are running, how many CPUs your job needs, how much memory your job needs, and if your job requires a GPU or not.
The number of nodes will also be dependent on if you are submitting to a public partition, one that is open to all other uses on the cluster, or if you are using a smaller, private partition that belongs to a specific group or lab.
To determine how many nodes you should request in for your job, please read the factors below to help determine what would be the most efficent for your workflow.
If you are unsure of how many nodes you should request, but you have an idea of how many CPUs and how much memory your job needs, you can ommit the -N or –nodes flag entirely and Slurm will spread your job across as many nodes as it sees fit according to its job scheduling algorithms.
You can also request a range of nodes by editing the -N flag to be something like -N [8-16]
which tells Slurm to request at least 8 nodes, but no more than 16, and it will request between 8 and 16 nodes depending on what resources are free.
CPUs per node, or ‘ntasks’ as slurm identifies them, determine how many CPU cores your job will use to run. Most nodes on the engaging cluster, including the public partitions such as engaging_default, have between 16 and 20 CPUs.
You can view the amount of CPUs per specific nodes with the command:
scontrol show node node[number]
and looking at the CPUTot value.
The amount of CPUs your job needs will depend on how process intense your job is. For example, jobs that run complex calculations and spawn multiple processes within the initial job should run on more CPUs than say, a simple python program that parses files.
When requesting CPUs, if you specify an amount of nodes that has less CPUs than the amount you request, your job will not run due to insuffienct resources.
If you are unsure of how many nodes to request, but know your job needs 32 CPUs, you can ommit the -N or –nodes flag and Slurm will spread your job across as many nodes as necessary to provide your job with 32 CPUs.
The amount of memory per node will vary based on the complexity of your job and how much data it generates while running. If you are running a job that outputs lots of small data files, you will want a fair amount of memory in order for your job to be able to write these files efficently as your job runs.
Please note that this flag is for *memory per node * , and that if you would like your job to have a total memory of 100GB, and you are requesting 2 nodes, this flag would need to say --mem=50GB
since 100GB divided across 2 nodes is 50GB per node. If you set this flag as the total number of memory you want for your job, you will end up requesting far more memory than you intended, since it will be multiplied by the number of nodes you request.
Memory per CPU is different than memory per node itself. If you request 4 CPUs on 1 node, but you request 100GB of memory per CPU, that node will have to provide 400GB of memory for your job to run, where as if you only need 100GB of memory, using the –mem flag will allocate just that 100GB since it is memory per node, not memory per CPU.
Unless your job specifically uses a GPU, most of the time requesting one is not necessary. For jobs that do specifically need to use a GPU, the amount of GPUs you can request for your job will depend on how many GPUs are available.
For example, on the sched_mit_hill partition, which is the default public partition, there are some nodes that have 1x Testla K20M GPUs. Unless your job specifically needs more than 1 GPU, you can request 1 of these GPU nodes by adding the –gres=1 flag to your sbatch job script. If you do need more than 1 GPU and are submitting to this partition, you will want to specify that you also need more than 1 node, since requesting >1 GPU on 1 node will most likely stay pending for lack of available resources.
Example Scenario:
Consider the following scenario:
A user wants to run a fairly intenstive job that runs using ~9GB of data files while it is running, meaning these files need to be open in RAM (memory) for the job to read from and use. A user starts out with the following sbatch script:
#SBATCH -N 6
#SBATCH --constraint=centos7
#SBATCH --partition=sched_mit_hill
#SBATCH --error=error.txt
#SBATCH -n 100
#SBATCH --mem=50GB
However, when they submit their job, it sits pending for a long time, and then when it does run, it eventually fails due to not enough memory being available for the job.
They would like to know how they can:
speed up the amount of time their job waits for resources
request enough memory for the job so it does not fail.
According to their sbatch script, they are requesting 100 CPUs across 6 nodes.
100 CPUs across 6 nodes would request 16-17 CPUs per node. This is somewhat reasonable, but this is probably why their job is taking some time to submit. The nodes on the sched_mit_hill partition have either 16 or 20 CPUs total, so waiting for 6 nodes to completely free up and use all 16 CPUs on them is what is delaying their job submission.
We suggest first to pick a CPU total that is a multiple of 1^2, (2, 4, 8, 16, 32, etc) so in this case, 128, and then distribute them across more nodes, we would suggest 16, so that they are requesting 8 CPUs per node, or half the total CPUs of the nodes. This will be much more likely to free up sooner than 100 CPUs across 6 nodes. They can also request a range of nodes by editing the -N flag to be something like -N [8-16]
which tells Slurm to request at least 8 nodes, but no more than 16, and it will request between 8 and 16 nodes depending on what resources are free.
If you have any questions about using slurm, please email orcd-help-engaging@mit.edu