In a highly parallelized HyperOpt instance, why doesn't subprocess.Popen/os.system with > and 2> work?

Question

I need to run HyperOpt in a very parallelized manner (like 200+ workers). With less than 200 workers, all works fine. But when more workers are started, I cannot get the Output of any program anymore (which I need) from the objective function.

I already tried subprocess.popen() and it did work perfectly unless there are more than 200 workers. So I switched from it to os.system(program + " > " stderr + " 2> " stdout) and then reading the stderrlog and stdoutlog files afterwards, which also works fine unless there are more than 200 workers.

The stderrlog file works perfectly fine. But the stdoutlog most of the time stays empty or only gets some random lines. The files exist, though, they're only incomplete. The programs don't seem to exit and there is no indication of anything going wrong in the logs, except for HyperOpt not finding results. Again, it works, if I only do things like 'echo "hello world"' as code (very short outputs).

I am running this in a SLURM-environment on a HPC Cluster, but there are no slurm-errors either.

helper.py:

import os
def run_program(program, logfiles):
    stdout = logfiles["stdout"]
    stderr = logfiles["stderr"]

    code = program + " > " + stdout + " 2> " + stderr

    retcode = os.system(code)
    retcode = retcode >> 8
    out, err = '', ''

    if os.path.exists(stdout):
        out = Path(stdout).read_text()

    if os.path.exists(stderr):
        err = Path(stderr).read_text()


    array = {
        "stdout": out,
        "stderr": err,
        "retcode": retcode
    }

    return array

objective_function_mongodb.py:

import helper
import uuid
def objective_function_mongodb(parameter):
    specific_log_file = uuid.uuid4()
    log_files = {
        "stderr": specific_log_file + ".stderr",
        "stdout": specific_log_file + ".stdout"
    }

    program = "programs/" + parameter[0] + "/run.sh"

    res = helper.run_program(parameter_code, log_files)
    re_search = 'RESULT: ([+-]?\d+(?:\.\d+)?)\\n'
    res = m.group(1)

    return res

import objective_function_mongodb
servername = ...
serverport = ...

projectname = test
space = hp.choice('a', [hp.randint("axis1", 10000)])
mongourl = 'mongo://' + str(serverip) + ":" + str(serverport) + '/' + projectname + '/jobs'
trials = MongoTrials(mongourl, exp_key=projectname)

best = fmin(
    'fn':                       objective_function_mongodb.objective_function_mongodb,
    'trials':                   trials,
    'space':                    [projectname, space],
    'algo':                     hp.randint,
    'max_evals':                100000,
    'catch_eval_exceptions':    True
)

The programs/test/run.sh produces a few thousand lines of code (it itself calls a python-script again) which I need for further analysis, so just grepping for RESULTS in the Output of it doesn't do the deal.

I also tried this as run_program's code (with an additional logfile), which, according to a website I am unable to find anymore, claimed that this would solve problems with the buffer size of Popen being larger than 64K:

def print_to_log(string, logfile):
    folder = os.path.dirname(logfile)
    pathlib.Path(folder).mkdir(parents=True, exist_ok=True)

    append_write = 'a'
    if not os.path.exists(logfile):
        append_write = 'a'
    else:
        append_write = 'w'

    logfilehandler = open(logfile, append_write)

    print(string, file=logfilehandler)

    logfilehandler.close()

import subprocess

def objective_function_mongodb(parameter):
    programconverted = []
    print_to_log("Before Conversion", logfile)
    for i in program.split():
        programconverted.append(str(i))
        print_to_log("In Conversion", logfile)

    print_to_log("Definition of MAX_OUTPUT_SIZE", logfile)
    MAX_OUTPUT_SIZE = 2 ** 64

    print_to_log("Starting Timer", logfile)
    start = time.time()
    print_to_log("Intializing Sockets", logfile)
    stdout = socket.socketpair()
    stderr = socket.socketpair()
    # nonblocking and timeout is not the same, timeout is easier to handle via socket.timeout exception
    print_to_log("Setting Timeouts", logfile)
    stdout[0].settimeout(0.01)
    stderr[0].settimeout(0.01)
    print_to_log("Begin popen", logfile)
    p = subprocess.Popen(programconverted, stdout=stdout[1], stderr=stderr[1], close_fds=True)
    print_to_log("Create empty out und err", logfile)
    out, err = "", ""
    print_to_log("Initizializing returncode", logfile)
    returncode = None
    print_to_log("Begin Loop", logfile)
    loopnr = 0

    while True:
        print_to_log("p.poll()", logfile)
        p.poll()

        print_to_log("Get stdout/stderr", logfile)
        try:
            outtmp = stdout[0].recv(4096).decode('utf-8')
        except socket.timeout as exc:
            outtmp = ""

        try:
            errtmp = stderr[0].recv(4096).decode('utf-8')
        except socket.timeout as exc:
            errtmp = ""

        print_to_log("Adding (out/err)tmp onto (out/err), loopnr: " + str(loopnr), logfile)
        out += str(outtmp)
        err += str(errtmp)

        if len(out) > MAX_OUTPUT_SIZE or \
           len(err) > MAX_OUTPUT_SIZE:
            print_to_log("Killing process because it's output is bigger than MAX_OUTPUT_SIZE", logfile)
            p.kill()
            p.wait()
            out = out[:MAX_OUTPUT_SIZE]
            err = err[:MAX_OUTPUT_SIZE]
            out += "Output was truncated to " + str(MAX_OUTPUT_SIZE)

        if p.returncode != None:
            print_to_log("Returncode: " + str(p.returncode), logfile)
            returncode = p.returncode
            break
        time.sleep(0.1)
        loopnr = loopnr + 1
    end = time.time()
    # we now have: returncode, out, err, start, end

    stderr[0].close()
    stdout[1].close()

    out = str(out)
    err = str(err)

    array = {
        "stdout": out,
        "stderr": err,
        "retcode": returncode
    }



    return array

In the debug-logfiles (printed by print_to_log) only one random line of "Adding (out/err)tmp onto (out/err), loopnr: " + str(loopnr)" with a random line number appears, but nothing else (nothing before and nothing after that). The program never get's killed though, according to the logs.

The stderr files work, but the stdout files are mostly either empty or show only random lines from the output if there are more than 200 workers in parallel.

I excepted it to write the stdout perfectly fine, especially when using > and 2>.

ulimit -a shows me

ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 256025
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) 307200
open files                      (-n) 1048576
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

score 0 · Answer 1 · answered Jun 09 '19 at 09:19

0

You are probably running up against an OS limit; with three file handles per process, 200 subprocesses require 600 open file handles. Some OSes set this limit as low as a few dozen; but perhaps you can bump the limit upwards with ulimit.

As an aside; no offense, but your run_process is a rather crufty reimplementation of subprocess.run().

answered Jun 09 '19 at 09:19

tripleee

175,061
34
275
318

Perhaps see also https://stackoverflow.com/questions/4256107/running-bash-commands-in-python/51950538#51950538 – tripleee Jun 09 '19 at 09:20
Thanks for the comment, I'm relatively new to Python so I will look up subprocess.run. I've added the ulimit -a output. Is it "open files (-n) 1048576"? if so, isn't 1048576 more than enough? – JohnnySmithensen Jun 09 '19 at 09:25
Thanks for the update; I'll leave this answer for now even though it's apparently not the explanation for your problem. – tripleee Jun 09 '19 at 11:12

In a highly parallelized HyperOpt instance, why doesn't subprocess.Popen/os.system with > and 2> work?

1 Answers1