0

When I run following command I am able to see bunch of slurm jobs. Since I can see them, I believe their log should be saved.

$ sacct --format="JobID,JobName%30"                          
       JobID                        JobName
------------ ------------------------------
3            19kuX6ge4WzE2cyRtAUozP1SSE9HR+
3.batch                               batch
4            19kuX6ge4WzE2cyRtAUozP1SSE9HR+
4.batch                               batch
5            19kuX6ge4WzE2cyRtAUozP1SSE9HR+
5.batch                               batch
9.batch                               batch
2                                    run.sh
2.batch                               batch

$ sacct --jobs=4                                             
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
4            19kuX6ge4+      debug      alper          1  COMPLETED      0:0
4.batch           batch                 alper          1  COMPLETED      0:0

Afterwards, when I do: scontrol show job <job_id>, I won't able to return the complete job's information.

$ scontrol show job 4                                       
slurm_load_jobs error: Invalid job id specified

What may be the reason for this? Is there any alternative way to fetch the job's information such as its RunTime.

alper
  • 2,919
  • 9
  • 53
  • 102

1 Answers1

2

scontrol only shows information about currently running, or recently finished, jobs. The "recently finished" time depends on the installation but is 5 minutes by default (I think). sacct returns information from the accounting database, so works for all jobs.

ciaron
  • 1,089
  • 7
  • 15
  • Can I set `"recently finished` time to lets say 365 days or infinite? Since `scontrol` seems like returns much for information I want to able to use it all the time @ciaron – alper Jul 19 '20 at 19:03
  • Should I set `PURGE_COMP` variable in the `slurm.conf` file? – alper Jul 19 '20 at 19:10
  • 1
    I don't know off the top of my head if there's a parameter in `slurm.conf` to control the timeout, but I suspect not. If I remember correctly, the `scontrol` data is held in memory on the `slurmctld` machine, and can get quite large (several GB over time) and needs to be purged to disk/database. You may find the information you need in `sacct` by setting the output format string for it. There's a lot more there than the default data. – ciaron Jul 19 '20 at 19:47
  • PS `PURGE_COMP` is only for reservations, IIRC. – ciaron Jul 19 '20 at 19:47
  • As understand its better to keep it 5 minutes as its default time to delete those – alper Jul 19 '20 at 22:12
  • 3
    The parameter is `MinJobAge` – damienfrancois Jul 20 '20 at 09:13