This post is a collection of notes of tips and tricks when trying to set up slurm server on Ubuntu 20.04 focal.
-
First, we set up
munge, which can be installed on Ubuntu via,sudo apt install munge -
Install
slurmdandslurmctld,sudo apt install slurm-wlmIf the installation fails complaining about
error "adduser: The UID 64030 is already in use", the following commands may be helpful,sudo apt install libuser sudo luseradd -r --shell=/bin/false -M --uid=64030 slurm -
Write the following contents into
/etc/slurm/slurm.conf,ControlMachine=iris2 AuthType=auth/munge CryptoType=crypto/munge MpiDefault=none ProctrackType=proctrack/linuxproc ReturnToService=2 SlurmctldPidFile=/run/slurm/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/run/slurm/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd SlurmUser=slurm StateSaveLocation=/var/lib/slurm-llnl/slurmctld SwitchType=switch/none TaskPlugin=task/none InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 SchedulerType=sched/backfill SelectType=select/cons_res SelectTypeParameters=CR_Core AccountingStorageType=accounting_storage/none AccountingStoreFlags=job_comment ClusterName=cluster JobCompType=jobcomp/none JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=3 SlurmdDebug=3 NodeName=iris2 CPUs=32 Boards=1 SocketsPerBoard=32 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=64294 PartitionName=batch Nodes=iris2 Default=YES MaxTime=INFINITE State=UPThe host name
iris2(occurring in multiple locations) in the above quoted configuration should be replaced by whatever the host name of the server is, which can be obtained by runninghostnameon the terminal.The second line from the bottom should also be replaced by relevant info to the server. The info can be obtained by running
slurmd -Cin the terminal. Theslurmdcommand will become available after step-2. -
Start
slurmdandslurmctldservices by running,sudo service slurmd start sudo service slurmctld startrespectively, and check their status, via,
sudo service slurmd status sudo service slurmctld statusQuite often, we may have some issues with the PID files for either
slurmd, orslurmctld, or both. Using the configuration file above, we may need to create the file/etc/tmpfiles.d/slurmd.confand put the following line into the file,d /run/slurm 0770 root slurm -which can guarantee that the
slurmcan be created with the right permission after each rebooting. Also, we need to create the directories/var/lib/slurm-llnl/slurmdand/var/lib/slurm-llnl/slurmctldand give them the right owner (i.e.,slurm), via,sudo mkdir -p /var/lib/slurm-llnl/slurmd sudo mkdir -p /var/lib/slurm-llnl/slurmctld sudo chown slurm:slurm /var/lib/slurm-llnl/slurmd sudo chown slurm:slurm /var/lib/slurm-llnl/slurmctldIf running
sudo service slurmd startandsudo service slurmctld startfails due to that the port6818and/or6817is being used, we can run the following command to kill anything running on the relevant port, e.g.,sudo kill -9 `sudo lsof -t -i:6818`
Sometimes, when trying to submit jobs after the server is up running, it may show that the computing node is drained. To undrain it, we may need to follow the instruction in Ref. [1]. Also, sometimes we may see the log of the server is complaining about not being able to fine the PID file. This should not matter that much in terms of server running and jobs management.
If the watch -t squeue command shows the following status,
Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions
we can check the node status by running sinfo and we may possibly see the state of inval, indicating something going wrong with the slurm service. We can then run sudo systemctl status slurmctld to check the slurmctld status. It happened to me that after Ubuntu upgrade from 20.04 to 22.04, the original slurm.conf file does not work any more, causing the error shown above. In my case, I have to comment out the out-of-date FastSchedule=0 option. Also, I had to change the COMPUTE NODES definition lines, from,
NodeName=pc113118 RealMemory=95363 Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 State=UNKNOWN
PartitionName=batch Nodes=pc113118 Default=YES MaxTime=INFINITE State=UP
to,
NodeName=pc113118 CPUs=48 Boards=1 SocketsPerBoard=2 CoresPersocket=12 ThreadsPerCore=2 RealMemory=95331
PartitionName=batch Nodes=pc113118 Default=YES MaxTime=INFINITE State=UP
To obtain the information to be put in the node definition, we can use the slurmd -C command. Then I needed to copy /etc/slurm/slurm.conf to /etc/slurm-llnl/slurm.conf, followed by first running sudo systemctl restart slurmd then running sudo systemctl restart slurmctld.
References
[1] https://stackoverflow.com/questions/29535118/how-to-undrain-slurm-nodes-in-drain-state