Python multithreading takes longer to execute the multiple jar files

Question

I am using ThreadPoolExecutor and giving exact same tasks to workers. The task is to run a jar file and do something with it. the problem I am facing is related to timings.

Case 1: I submit one task to the pool and the worker completes in 8 seconds.

Case 2: I submit same task twice into the pool, and both workers completes around ~10.50 seconds.

Case 3: I submit same task thrice into the pool, and all three workers completes around ~13.38 seconds.

Case 4: I submit same task 4 times into the pool, and all fore workers completes around ~18.88 seconds.

If I replace the workers tasks to time.sleep(8) (instead of running jar file), then all 4 workers finish at ~8 seconds. Is this because of the fact that, the OS before executing java code has to create java environment first, which the OS is not able to manage it in parallel ?

Can someone explain me why is the execution time increasing for same task, while running in parallel ?Thanks :)

Here is how I am executing the pool;

   def transfer_files(file_name):
        raw_file_obj = s3.Object(bucket_name='foo-bucket', key=raw_file_name)
        body = raw_file_obj.get()['Body']

        # prepare java command
        java_cmd = "java -server -ms650M -mx800M -cp {} commandline.CSVExport --sourcenode=true --event={} --mode=human_readable --configdir={}" \
        .format(jar_file_path, event_name, config_dir)

        # Run decoder_tool by piping in the encoded binary bytes
        log.info("Running java decoder tool for file {}".format(file_name))
        res = run([java_cmd], cwd=tmp_file_path, shell=True, input=body.read(), stderr=PIPE, stdout=PIPE)
        res_output = res.stderr.decode("utf-8")

        if res.returncode != 0:
            if 'Unknown event' in res_output:
                log.error("Exception occurred whilst running decoder tool")
                raise Exception("Unknown event {}".format(event_name))
        log.info("decoder tool output: \n" + res_output)

    with futures.ThreadPoolExecutor(max_workers=MAX_WORKERS) as pool:
            # add new task(s) into thread pool
            pool.map(transfer_file, ['fileA_for_workerA', 'fileB_for_workerB'])

score 1 · Answer 1 · answered Nov 27 '19 at 16:57

Using multithreading doesn't necessarily mean it will execute faster. You would have to deal with the GIL for Python to execute the commands. Think of it like 1 person can do 1 task faster than 1 person doing 2 tasks at the same time. He/she would have to multitask and do part of thread 1 first, than switch to thread 2, etc. The more threads, the more things the python interpreter has to do.

The same thing might be happening for Java too. I don't use Java but they might have the same problems. Here, Is Java a Compiled or an Interpreted programming language ? it says that the JVM converts Java on the fly, so the JVM would probably have to deal with the same problems as Python.

And, for the time.sleep(8), what it does is just use up processor time for the thread, so it would be easy to switch between a bunch of waiting tasks.

Python multithreading takes longer to execute the multiple jar files

1 Answers1