This post is a collection of notes of tips and tricks when trying to set up slurm server on Ubuntu 20.04 focal.
The instructions given in Ref. [1] has been followed to set up the single-node slurm service on Ubuntu. Several things are noteworthy, as below,
-
Before following the instructions for the setup, we need both
slurm
andmunge
user available on the server. To set up those users, if they are not already existing, we need to follow the instruction in Ref. [2]. -
We can ignore step 5-8 in Ref. [1], and instead we can just use the quoted slurm configuration file below as a minimal working version,
ControlMachine=orc-iris AuthType=auth/munge CryptoType=crypto/munge MpiDefault=none ProctrackType=proctrack/linuxproc ReturnToService=2 SlurmctldPidFile=/run/slurm-llnl/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/run/slurm-llnl/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd SlurmUser=slurm StateSaveLocation=/var/lib/slurm-llnl/slurmctld SwitchType=switch/none TaskPlugin=task/none InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 SchedulerType=sched/backfill SelectType=select/cons_res SelectTypeParameters=CR_Core AccountingStorageType=accounting_storage/none AccountingStoreJobComment=YES ClusterName=cluster JobCompType=jobcomp/none JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=3 SlurmdDebug=3 NodeName=orc-iris CPUs=64 Boards=1 SocketsPerBoard=64 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=54306 PartitionName=batch Nodes=orc-iris Default=YES MaxTime=INFINITE State=UP
-
The configuration file above should be saved as
/etc/slurm-llnl/slurm.conf
, but not the one mentioned in step-9 in Ref. [1]. -
The host name
orc-iris
(occurring in multiple locations) in the above quoted configuration should be replaced by whatever the host name of the server is, which can be obtained by runninghostname
in the terminal. -
The second line from the bottom should also be replaced by relevant info to the server. The info can be obtained by running
slurmd -C
in the terminal. Theslurmd
command will become available after running step 1-4 in Ref. [1]. Also, attention to the note in Ref. [2] about the reduction ofRealMemory
value.
After the configurations above, when running sudo service slurmd start
and sudo service slurmctld start
to start the service, sometimes they may fail to start. The reason could be multifold, but specifically in my case, it was due to that the port 6818
and/or 6817
is being used. In that case, we want to run the following command to kill anything running on the relevant port, e.g.,
sudo kill -9 `sudo lsof -t -i:6818`
Sometimes, when trying to submit jobs after the server is up running, it may show that the computing node is drained
. To undrain it, we may need to follow the instruction in Ref. [3]. Also, sometimes we may see the log of the server is complaining about not being able to fine the PID file. This should not matter that much in terms of server running and jobs management.
References
[1] https://signac.io/development/2020/06/26/local-SLURM-environment.html
[2] https://docs.siliconcompiler.com/en/latest/tutorials/slurmsetup.html
[3] https://stackoverflow.com/questions/29535118/how-to-undrain-slurm-nodes-in-drain-state