What does jug status 'Active' mean, and why does it not equal the number of procs requested?

Question

I've been unable to find what status 'Active' tasks are. I'm using JUG 2.1.1, and I don't see that word appear anywhere in the manual, except in a footnote about 'active-wait'.

I'm using an LSF array to run a large number (hundreds of thousands) of minutes-long single core jobs. Peculiarly, although jobs do move from 'Ready' to 'Complete', and none are listed as 'Failed' or 'Waiting', I have no column in the output from status for 'Running' (which I've seen in the worked examples) and instead have a column called 'Active'. The number of active tasks varies, but is between 800 and 950 for an LSF array with 2000 elements. According to LSF (output of bjobs -r), each of the elements in the job array shows status 'RUN'. Although I have not done it exhaustively, manually sshing to a node some of my jobs have landed on and then running 'htop' to look at utilization shows the expected number of processes, each pinning an available core. It is conceivable that there are some processes in my job array that are not doing this, however, since what I did amounts to a spot-check.

Does Running == Active for the output of jug status? Am I failing to use about 1100 processors that I am nonetheless occupying with nominally single-threaded jobs?

Thanks for the input. Happy to provide more details as needed.

For more open-ended discussion of how to use jug, consider the mailing list https://groups.google.com/g/jug-users — luispedro, May 02 '22 at 14:09

score 1 · Accepted Answer · answered May 02 '22 at 14:08

1

(author of jug here): It does mean "jobs running right now".

If you are using the file backend, and are running 1,000s of jobs simultaneously, it may just be that the counting is not syncing properly: as jug status is working, some jobs may be running, but it does not see them as running because between the moment it starts listing the locks and going through the list of jobs, they have finished and others started. Also, the listing of locks can be out of sync on a network filesystem (it should not matter for actually creating locks, but that process is much slower and we do not wish to pay the cost for jug status).

This should be much less serious with the redis backend, btw.

answered May 02 '22 at 14:08

luispedro

6,934
4
35
45

I am using an NFS type setup, but the individual jug tasks likely take many minutes (like maybe a half hour), not seconds, so I'd be surprised if it was _that_ slow to sync. I don't know anything about the technicalities of using the redis backend, but if it's easy to switch I'd be happy to do that. Is that your rec? – LGS May 02 '22 at 14:40
Oh, in that case, it does seem strange. Some NFS setups are quite slow, but still... – luispedro May 02 '22 at 17:42
There isn't a lot of magic at this level: you can count the number of files in the `jugfile.jugdata/locks` directory and see how many are running – luispedro May 02 '22 at 17:48
OK. Those jobs finished, but if this continues to happen I will do that and report back. – LGS May 02 '22 at 19:07

What does jug status 'Active' mean, and why does it not equal the number of procs requested?

1 Answers1