Why are the parallel tasks always slow at the first time?

Question

I have some classifiers which I want to evaluate on the one sample. This task can be ran in parallel since they are independent of each other. This means that I want to parallelize it.

I tried it with python and also as a bash script. The problem is that when I run it the program for the first time, it takes like 30s-40s to finish. When I run the program multiple times consecutively, it takes just 1s-3s to finish. Even If I fed classifiers with different input I got different result so it seems that there is no caching. When I run some other program and afterwards rerun the program then it again takes 40s to finish.

I also observed in htop that CPUs are not that much utilized when the program is run for the first time but then when I rerun it again and again the CPUs are fully utilized.

Can someone please explain me this strange behaviour? How can I avoid it so that even the first run of the program will be fast?

Here is the python code:

import time
import os
from fastText import load_model
from joblib import delayed, Parallel, cpu_count
import json

os.system("taskset -p 0xff %d" % os.getpid())

def format_duration(start_time, end_time):
    m, s = divmod(end_time - start_time, 60)
    h, m = divmod(m, 60)
    return "%d:%02d:%02d" % (h, m, s)

def classify(x, classifier_name, path):
    f = load_model(path + os.path.sep + classifier_name)    
    labels, probabilities = f.predict(x, 2)
    if labels[0] == '__label__True':
        return classifier_name
    else:
        return None

if __name__ == '__main__':
    with open('classifier_names.json') as json_data:
        classifiers = json.load(json_data)
    x = "input_text"

    Parallel(n_jobs=cpu_count(), verbose=100, backend='multiprocessing', pre_dispatch='all') \
        (delayed(perform_binary_classification)
         (x, classifier, 'clfs/') for
         classifier in classifiers)

    end_time = time.time()
    print(format_duration(start_time, end_time))

Here is the bash code:

#!/usr/bin/env bash
N=4
START_TIME=$SECONDS
open_sem(){
    mkfifo pipe-$$
    exec 3<>pipe-$$
    rm pipe-$$
    local i=$1
    for((;i>0;i--)); do
        printf %s 000 >&3
    done
}
run_with_lock(){
    local x
    read -u 3 -n 3 x && ((0==x)) || exit $x
    (
    "$@" 
    printf '%.3d' $? >&3
    )&
}
open_sem $N
for d in classifiers/* ; do
    run_with_lock ~/fastText/fasttext predict "$d" test.txt 
done

ELAPSED_TIME=$(($SECONDS - $START_TIME))
echo time taken $ELAPSED_TIME seconds

EDITED

The bigger picture is that I am running flask app with 2 API methods. Each of them calls the function that parallelize the classification. When I am doing requests, it behaves the same way like this program below. First request to method A takes a lot and then subsequent requests take like 1s. When I switch to method B it is the same behavior as with method A. If I switch between method A and method B several times like A,B,A,B then each request takes like 40s to finish.

Probably because it has to load libraries int memory and caching the program, etc. — Willem Van Onsem, Dec 19 '17 at 15:31
@WillemVanOnsem but it should free the memory right? Why is the similar behavior observed also in case of bash script where no libs are imported. The RAM is always freed after the program run. If it is somehow caching the program, how can I pre-cache it? — mark, Dec 19 '17 at 15:42
no programs stay in memory, unless the memory is needed. It is "free" in the sense that it can be occupied by something else, but modern OSs will mark what was originally there, since it is possible you want to reuse the same program later. See it like this: you execute program `a`, program `a` is (partially) loaded in memory, program `a` terminates, OS marks the memory of `a` free, you execute `a` again, each time you load a piece of the program the OS looks if it is still in memory somewhere. — Willem Van Onsem, Dec 19 '17 at 15:46
Also, if you have an instance of a program in memory and you execute another instance, the code part is shared among instances (and it does not need to be loaded again), while each instance has its own data. The "preload" you want to do is done for you by the OS when you execute for the first time. — Javier Elices, Dec 19 '17 at 15:53
Ok. I got it. Why then the cores in htop do not show high utilization for the first time? They show it then with consecutive calls of the program. BTW I updated the question and added a bigger picture of my question. — mark, Dec 19 '17 at 16:08
The other thing is that the RAM in htop is not growing much with subsequent runs of the programs. Even if it lies in the memory it should show in htop that it is again occupied right? — mark, Dec 19 '17 at 16:30
Depending on your OS, but if the libraries and interpreter are called for the first time, they must be loaded from disk, what may take a long time (though 40 sec is longish). subsequent reads may be done from buffers, which are kept in memory. On linux, you will see a 'wait IO' that increases the first time. If that is the case, you will profit from ssd, but only the first time. — Ljm Dullaart, Dec 19 '17 at 17:48
Yeah. I checked it. The first time it did some reads and then it did not do nothing the consecutive times. This article is quite handy (http://bencane.com/2012/08/06/troubleshooting-high-io-wait-in-linux/). The question is how can I speed up parallel IO read operations? — mark, Dec 20 '17 at 10:13
Do you see the same behaviour if you remove the `taskset` and run: `parallel --joblog - ~/fastText/fasttext predict {} test.txt ::: classifiers/*`? — Ole Tange, Dec 21 '17 at 09:16

score 1 · Answer 1 · answered Aug 26 '21 at 14:13

One approach is to modify your python code to use an event loop, stay running all the time, and execute new jobs in parallel whenever new jobs are detected. One way to do this is is to have a job directory, and place a file in that directory whenever there is a new job todo. The python script should also move completed jobs out of that directory to prevent running them more than once. How to run an function when anything changes in a dir with Python Watchdog?

Another option is to use a fifo file which is piped to the python script, and add new lines to that file for new jobs. https://www.linuxjournal.com/content/using-named-pipes-fifos-bash

I personally dislike parallelizing in python, and prefer to parallelize in bash using GNU parallel. To do it this way, I would

implement the event loop and jobs directory or the fifo file job queue using bash and GNU parallel
modify the python script to remove all the parallel code
read each jobspec from stdin
process each one serially in a loop
pipe jobs to parallel, which pipes them to ncpu python processes, which each runs forever waiting for the next job from stdin

e.g., something like:

run_jobs.sh:
mkfifo jobs
cat jobs | parallel --pipe --round-robin -n1 ~/fastText/fasttext

queue_jobs.sh:
echo jobspec >> jobs

.py:
for jobspec in sys.stdin:
    ...

This has the disadvantage that all ncpu python processes may have the slow startup problem, but they can stay running indefinitely, so the problem becomes insignificant, and the code is much simpler and easier to debug and maintain.

Using a jobs directory and a file for each jobspec instead of a fifo jobs queue requires slightly more code, but it also makes it more straightforward to see which jobs are queued and which jobs are done.

Why are the parallel tasks always slow at the first time?

1 Answers1