4

The way I start slurm:

mkdir -p /tmp/slurmstate/clustername
sudo slurmd
sudo munged -f
/etc/init.d/munge start 
sudo slurmdbd
sudo slurmctld -c

-

sacctmgr list cluster
   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS   Def QOS
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- ---------
   cluster                            0  7936         1                                                                                           normal

Running slurmctld -cD gives me following error. Cluster name returns some invalid string that I don't know. How could I fix it?

> slurmctld -cD
slurmctld: fatal: CLUSTER NAME MISMATCH.
slurmctld has been started with "ClusterName=�����", but read "cluster" from the state files in StateSaveLocation.
Running multiple clusters from a shared StateSaveLocation WILL CAUSE CORRUPTION.
Remove /tmp/slurmstate/clustername to override this safety check if this is intentional (e.g., the ClusterName has changed).

Note: When I try to run slurm as root user and switch back, this problem start occurring. I had to re-install mysql to make it fix.

Thank you for your valuable time and help.

alper
  • 2,919
  • 9
  • 53
  • 102
  • You could check the value of `ClusterName ` in `slurm.conf` and make sure the encoding of that file is correct. – damienfrancois Jun 12 '17 at 07:54
  • Actually it is on the `slurm.conf file` as `ClusterName=cluster`. @ damienfrancois – alper Jun 12 '17 at 11:42
  • 1
    1) Why do you create `/tmp/slurmstate/clustername` as a directory (from your latest edit)? --- 2) I think the error message is incorrect, if I read [the source code correctly](https://github.com/SchedMD/slurm/blob/slurm-17-02-4-1/src/slurmctld/controller.c#L2634-L2643), the non-printable characters were found in state file, not in `slurm.conf` (the error is real but the message unfortunately switches the values) – Hugues M. Jun 18 '17 at 13:45
  • Oh... then let's undelete my answer that I thought was just noise :) – Hugues M. Jun 18 '17 at 16:34

1 Answers1

3

I'm a complete SLURM noob (just started taking interest in it for work), so apologies if I make misguided suggestions, but I think I can point at something wrong.

First line in your startup sequence:

mkdir -p /tmp/slurmstate/clustername

So you create a directory here, I mean clustername is a directory.

When the daemon starts, it tries to read this as a file (using fopen and fgets, see source code of latest version).

And then, since the behavior of fopen-ing a directory is system-dependent, anything can happen (it could read garbage, or fail...). It would be interesting if you could specify what OS you are using.

Suggestion:

  • rmdir /tmp/slurmstate/clustername

  • replace your first line with mkdir -p /tmp/slurmstate, to create the slurmstate directory if it does not exist, but do not create a clustername file (or directory!) yourself in there.

First time it will create the clustername file, and write in there the name as taken from your slurm.conf file. Subsequent startups will read the value from the file and compare it from the one in slurm.conf, and move on with the startup.

PS : I just noticed that you added that line in your last edit, so you had the root problem before doing that... so the problem I'm pointing at is probably nothing. Maybe I should remove that answer (again) (but maybe your question will need another edit).

Hugues M.
  • 19,846
  • 6
  • 37
  • 65