How to "undrain" slurm nodes in drain state

Question

Using sinfo it shows 3 nodes are in drain state,

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*         up   infinite      3  drain node[10,11,12]

Which command line should I use to undrain such nodes?

elm · Accepted Answer · 2015-04-09T12:05:15.070

42

Found an approach, enter scontrol interpreter (in command line type scontrol) and then

scontrol: update NodeName=node10 State=DOWN Reason="undraining"
scontrol: update NodeName=node10 State=RESUME

Then

scontrol: show node node10

displays amongst other info

State=IDLE

Update: some of these nodes got DRAIN state back; noticed their root partition was full after e.g. show node a10 which showed Reason=SlurmdSpoolDir is full, thus in Ubuntu sudo apt-get clean to remove /var/cache/apt contents and also gzipped some /var/log files.

edited Apr 09 '15 at 12:05

answered Apr 09 '15 at 10:52

elm

20,117
14
67
113

8

You do not need to go through the DOWN state, you can directly issue the ``update ... state=resume`` command – damienfrancois May 06 '15 at 13:39

irritable_phd_syndrome · Answer 2 · 2018-07-11T12:00:11.387

31

If no jobs are currently running on the node:

scontrol update nodename=node10 state=idle

If jobs are running on the node:

scontrol update nodename=node10 state=resume

edited Jul 11 '18 at 12:00

answered Jul 11 '18 at 11:50

irritable_phd_syndrome

4,631
3
32
60

score 16 · Answer 3 · answered Nov 09 '15 at 16:23

16

If you set it to down all jobs will be killed.

Set the node to RESUME instead.

answered Nov 09 '15 at 16:23

LiPi

308
2
6

Paul Henderson · Answer 4 · 2020-05-07T15:59:35.317

4

The other reason a node is in the DRAIN state is if the facts about the system do not match those declared in the /etc/slurm/slurm.conf file. For example, if the slurm.conf file declares that a node has 4 GPUs, but the slurm daemon only finds 3 of them, it will mark the node as "drain" because of the mismatch. Or if the node is declared in slurm.conf to have 128G of memory, and the slurm daemon only finds 96G, it will also set the state to "drain".

The reason code for mismatches is displayed by the 'scontrol show node ' command as the last line of output.

edited May 07 '20 at 15:59

answered May 07 '20 at 15:50

Paul Henderson

41
2

This turned out to be the case for me. I recently disabled SMT on my AMD processors and found all my nodes in the `drain` state because Slurm was expecting 2 threads per core (as that was what was in the node spec). – Sean W Feb 10 '22 at 02:02

Araneus0390 · Answer 5 · 2021-11-17T20:30:54.130

While there is already an approved answer, I would like to mention that going through:

scontrol: update NodeName=nodename State=DOWN Reason="undraining"
scontrol: update NodeName=nodename State=RESUME

returns slurm_update error: Invalid node state specified for SLURM 21.08.03 on EndeavourOS 2021.08.27. The solution that worked for me is:

scontrol: update NodeName=nodename State=UNDRAIN

Without need to set the node DOWN

How to "undrain" slurm nodes in drain state

5 Answers5

Linked