0

I am having a strange problem (this is my first exercise using python).

I have a python script called run_class. I want to store the output (to stdout and stderr) in run-class.out.

So I do the following (after looking on the web at some examples)

nohup ./run_class > run-class.out &

I get:

[1] 13553 ~$ nohup: ignoring input and redirecting stderr to stdout

So, all is well for now. Indeed the program runs fine until I log out from the remote. Then the program comes crashing down. Logging out is exactly what is causing the program to crash. Not logging out takes the program to run to completion.

The run-class.out has the following error:

Traceback (most recent call last):                                              
  File "./run_class", line 84, in <module>                                      
    wait_til_free(checkseconds)                                                 
  File "./run_class", line 53, in wait_til_free                                 
    while busy():                                                               
  File "./run_class", line 40, in busy                                          
    kmns_procs = subprocess.check_output(['ps', '-a', '-ocomm=']).splitlines()  
  File "/usr/lib64/python2.7/subprocess.py", line 573, in check_output          
    raise CalledProcessError(retcode, cmd, output=output)                       
subprocess.CalledProcessError: Command '['ps', '-a', '-ocomm=']' returned non-zero exit status 1                                                               

What is wrong with my nohup?

Many thanks!

Note that my command works without exiting, so I don't quite understand the problem.

Btw: here is the program:

#!/usr/bin/python

import os
import os.path
import sys

ncpus = 8
datadir = "data" # double quotes preferred to allow for apostrophe's
ndatasets = 100
checkseconds = 1
basetries = 100

gs = [0.001, 0.005, 0.01, 0.05, 0.1]
trueks = [4, 7, 10]
ps = [4, 10, 100]
ns = [10, 100]  # times k left 1000 out, would be too much
shapes = ["HomSp"]
methods = ["Ma67"]


def busy(): 
    import subprocess
    output = subprocess.check_output("uptime", shell=False)
    words = output.split()
    sys.stderr.write("%s\n"%(output)) 
    try:
        kmns_procs = subprocess.check_output(['ps', '-a', '-ocomm=']).splitlines()
    except subprocess.CalledProcessError as x:
        print('ps returned {}, time to quit'.format(x))
        return
    kmns_wrds = 0
    procs = ["run_kmeans", "AdjRand", "BHI", "Diag", "ProAgree", "VarInf", "R"]
    for i in procs:
        kmns_wrds += kmns_procs.count(i)

    wrds=words[9]
    ldavg=float(wrds.strip(','))+0.8
    sys.stderr.write("%s %s\n"%(ldavg,kmns_wrds))
    return max(ldavg, kmns_wrds) >= ncpus


def wait_til_free(myseconds):
    while busy():
        import time
        import sys
        time.sleep(myseconds)

if True:
    for method in methods:
        for shape in shapes:
            for truek in trueks:
                for p in ps:
                    for n in ns:
                        actualn = n*truek
                for g in gs:
                            fnmprfix = "%sK%sp%sn%sg%s"%(shape,truek,p,n,g)
                            fname = "%sx.dat"%(fnmprfix)
                            for k in range(2*truek+2)[2:(2*truek+2)]:
                                ofprfix = "%sk%s"%(fnmprfix,k)
                                ntries =  actualn*p*k*basetries
                                ofname = "%s/estk/class/%s.dat"%(datadir,ofprfix,)
                                if os.path.isfile(ofname):
                                    continue
                                else :
                                    wait_til_free(checkseconds)
                                    mycmd = "nice ../kmeans/run_kmeans -# %s -N %s -n %s -p %s -K %s -D %s -X %s -i estk/class/%s.dat -t estk/time/%s_time.dat -z estk/time/%s_itime.dat -w estk/wss/%s_wss.dat  -e estk/error/%s_error.dat -c estk/mu/%s_Mu.dat -m %s &"%(ndatasets,ntries,actualn,p,k,datadir,fname,ofprfix,ofprfix,ofprfix,ofprfix,ofprfix,ofprfix,method)
                                    sys.stderr.write("%s\n"%(mycmd))
                                    from subprocess import call
                                    call(mycmd, shell=True)
user3236841
  • 1,088
  • 1
  • 15
  • 39
  • Do you see any error at the end of run-class.out? – Raniz May 11 '15 at 01:12
  • Sorry there is an error. Posted above. – user3236841 May 11 '15 at 01:13
  • possible duplicate of [python check\_output fails with exit status 1 but Popen works for same command](http://stackoverflow.com/questions/28675138/python-check-output-fails-with-exit-status-1-but-popen-works-for-same-command) – James Mills May 11 '15 at 01:26
  • Use ``Popen`` directly and understand your exit status(es). – James Mills May 11 '15 at 01:26
  • So I should replace the subprocess with Popen? – user3236841 May 11 '15 at 01:30
  • I am using subprocess for system calls: specifically, for runnins some executables and also for making sure that the number of executables executed at a given time does not exceed the total number of cpus available. Again the program works fine without logging out so the problem is an interaction with nohup. – user3236841 May 11 '15 at 01:32
  • FWIW I cannot reproduce your problem with a simple test case. – James Mills May 11 '15 at 01:34
  • btw, using -u option as in one of the stackoverflow answers did not help. – user3236841 May 11 '15 at 01:35
  • @user3236841: `Popen` _is_ part of `subprocess`. The `check_output` function is just a very simple wrapper around creating a `Popen`, calling `communicate` on it, and checking the status. – abarnert May 11 '15 at 01:54

1 Answers1

2

The ps command is returning an error (a nonzero exit status). Possibly just from being interrupted by a signal by your attempt to log out. Possibly even the very SIGHUP you didn't want. (Note that bash will explicitly send SIGHUP to every job in the job control table if it gets SIGHUP'd, and if the huponexit option is set, it does so for any exit reason.)

You're using check_output. The check part of the name means "check the exit status, and if it's nonzero, raise an exception". So, of course it raises an exception.

If you want to handle the exception, you can use a try statement. For example:

try:
    kmns_procs = subprocess.check_output(['ps', '-a', '-ocomm=']).splitlines()
except subprocess.CalledProcessError as x:
    print('ps returned {}, time to quit'.format(x))
    return
do_stuff(output)

But you can also just use a Popen directly. The high-level wrapper functions like check_output are really simple; basically, all they do is create a Popen, call communicate on it, and check the exit status. For example, here's the source to the 3.4 version of check_output. You can do the same thing manually (and without all the complexity of dealing with different edge cases that can't arise for your use, creating and raising exceptions that you don't actually want, etc.). For example:

ps = subprocess.Popen(['ps', '-a', '-ocomm='], stdout=subprocess.PIPE)
output, _ = ps.communicate()
if ps.poll():
    print('ps returned {}, time to quit'.format(ps.poll()))
    return
do_stuff(output)

Meanwhile, if you just want to know how to make sure you never get SIGHUP'd, don't just nohup the process, also disown it.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • OK, since this is inside a function (which actually returns True if number of cpus used is greater than total number of cpus, does it matter that there is a return in the exception or ps.poll? It is part of a function. – user3236841 May 11 '15 at 02:14
  • @user3236841: I just put the `return` there to prevent it from trying to run all the `do_stuff` code with output that either won't exist, or won't mean what you expect it to. How you handle this case in your real code is up to you. What do you want to happen when you fail to count the number of CPUs used? Probably treat it if everything's OK, we aren't using too many CPUs, I'd guess? – abarnert May 11 '15 at 02:29
  • I don't want it to fail, I guess:-) Basically, I am running about 300 jobs, 8 at a time. I don't want it to give up when it fails (I exit) because that will choke my machine. – user3236841 May 11 '15 at 02:36
  • @user3236841: So it sounds like you want to skip the rest of the checking and just `return False`, meaning we're not overusing CPUs, right? – abarnert May 11 '15 at 02:37
  • I guess, but does that mean that adding the processes will stop only while exiting? – user3236841 May 11 '15 at 02:42
  • @user3236841: I don't know what you're asking. I think I'd have to understand your whole program to know how to answer that. – abarnert May 11 '15 at 02:50
  • OK, I have posted the entire python program – user3236841 May 11 '15 at 02:52
  • Btw, returning False does not work. It goes on spawning processes. Perhaps I should change it to True and see? Edit: that does not work either because once logged out, the function stops evaluating and evaluating the processes. – user3236841 May 11 '15 at 03:12
  • Can I call the function back? So, return busy() for the exception because then it can go back and evaluate itself and when it does so, it will find that the process is not exiting and so it will go ahead and evaluate the rest? Edit: this does not work either. – user3236841 May 11 '15 at 03:49
  • @user3236841: From a quick glance: it looks like if you return false, it stops checking; if you return true, it goes back through the loop and checks again. So if you want it to ignore a failed ps and keep checking, you'd want to `return True`, not `return False`. (You may want to `print` or `log` more information, so if something else goes wrong, you can distinguish the "ps failed one time because it got HUP'd" from "real problem that I have to debug" cases. And probably write it the stderr rather than stdout to keep it from interfering with your useful output, if there is any.) – abarnert May 11 '15 at 03:54
  • But I did return True and then the program stops submitting new jobs upon logout (there is no error now but it does not do what I want). Btw, I changed the print statement and I get the exit status of 1 and that is what keeps on going. – user3236841 May 11 '15 at 04:28
  • upon logout (there is no error now but it does not do what I want). Btw, I changed the print statement and I get the exit status of 1 and that is what keeps on going. It seems to me that the exception should ultimately do the same thing as the non-exception. – user3236841 May 11 '15 at 04:48