4

I'm trying to write simple multi-threaded python script:

from multiprocessing.dummy import Pool as ThreadPool

def resize_img_folder_multithreaded(img_fldr_src,img_fldr_dst,max_num_of_thread):

    images = glob.glob(img_fldr_src+'/*.'+img_file_extension)
    pool = ThreadPool(max_num_of_thread) 

    pool.starmap(resize_img,zip(images,itertools.repeat(img_fldr_dst)))
    # close the pool and wait for the work to finish 
    pool.close() 
    pool.join() 


def resize_img(img_path_src,img_fldr_dest):
    #print("about to resize image=",img_path_src)
    image = io.imread(img_path_src)         
    image = transform.resize(image, [300,300])
    io.imsave(os.path.join(img_fldr_dest,os.path.basename(img_path_src)),image)      
    label = img_path_src[:-4] + '.xml'
    if copyLabels is True and os.path.exists(label) is True :
        copyfile(label,os.path.join(img_fldr_dest,os.path.basename(label)))

setting the argument max_num_of_thread to any number in [1...10] doesn't improve my run time at all (for 60 images it stays around 30 sec) , the max_num_of_thread=10 my PC got stuck

my question is : what is the bottle neck in my code , why can't I see any improvement?

some data about my PC :

python -V
Python 3.6.4 :: Anaconda, Inc.


cat /proc/cpuinfo | grep 'processor' | wc -l
4

cat /proc/meminfo 
MemTotal:        8075960 kB
MemFree:         3943796 kB
MemAvailable:    4560308 kB

cat /etc/*release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=17.10
LordTitiKaka
  • 2,087
  • 2
  • 31
  • 51

3 Answers3

2

Blame the GIL.

Python has this mechanism called the GIL, global interpreter lock. It is basically a mutex that prevents native threads from executing Python bytecodes at once. This must be done since Python's (at least, CPython) memory management is not thread-safe.

In other words, the GIL will prevent you from running multiple threads at the same time. Essentially, you're running one thread at a time. Multi-threading, in the sense of exploiting multiple CPU cores, is more like an illusion in Python.

Fortunately, there is a way to solve this problem. it's a bit more expensive resource-wise though. You can utilize multiprocessing instead. Python has excellent support for this through the multiprocessing module. This way, you will be able to achieve parallelism[1].

You might ask why isn't multiprocessing affected by the GIL limitations. The answer is pretty simple. Each new process of your program has a different instance (I think there's a better word for this) of the Python interpreter. This means that each process has its own GIL. So, the processes are not managed by the GIL, but by the OS itself. This provides you with parallelism[2].


References

Sean Francis N. Ballais
  • 2,338
  • 2
  • 24
  • 42
  • Re, "Multi-threading is more like an illusion in Python." Exploiting multiple CPUs is only one of several different reasons why people write multi-threaded programs. It's not even the _original_ reason, because threads have been around since at least a decade before multi-CPU computers became a thing you could buy. – Solomon Slow May 11 '18 at 14:48
  • @jameslarge, did I exaggerate too much on the "multi-threading is more like an illusion" part? Or should I have made clearer? I might have got some things wrong. Reading from your comment, I should do the proper edits. – Sean Francis N. Ballais May 11 '18 at 14:52
  • Also, interesting. Was the original reason more on allowing responsiveness? Or something else? – Sean Francis N. Ballais May 11 '18 at 14:52
  • I'm not a historian, but I first encountered threads when writing code for tiny, embedded systems. My very first-ever embedded project had five threads, each of which was responsible to wait for, and respond to a different external event. So, async I/O is one reason. I believe that people have also used threads as a substitute for co-routines, where different, independent state machines are running in different threads. – Solomon Slow May 11 '18 at 15:01
  • @jameslarge, cool! I'm still trying to wrap my head around co-routines. If I am getting things right, in Python, multi-threading has the same behaviour as co-routines. Am I correct? – Sean Francis N. Ballais May 11 '18 at 15:09
  • Coroutines are weird. With coroutines, you see an expression, `f(x)` in one coroutine that looks like a function call, but actually it causes a context switch, and the value `x` becomes the _return_ value from an expression, `g(x)` that looked like a function call in the other coroutine. Eventually the other co-routine does something similar to pass control and values back the other way. Python _generators_ are like co-routines, but less powerful. Python threads are more powerful. You can use threads to mimic coroutines, and you could use coroutines (if you had them) to mimic generators. – Solomon Slow May 11 '18 at 15:49
2

The problem come from the Global Interpreter Lock or GIL. GIL only let one thread run at a time so if you want to do parallel computation use Processing.Pool:

import multiprocessing

pool = multiprocessing.Pool(max_num_of_process)  # Use number of core as max number

!!! multiprocessing.dummy Is a wrapper arround the threading module, it let you interact with threading Pool as you where using Processing Pool.

Yassine Faris
  • 951
  • 6
  • 26
0

You should only use multiprocessing with the number of cpu cores you have available. You are also not using a Queue, so the pool of resources are doing the same work. You need to add a queue to your code.

Filling a queue and managing multiprocessing in python

eatmeimadanish
  • 3,809
  • 1
  • 14
  • 20
  • It's actually possible to create massive numbers of parallel processes, more than the number of cores by 1000's if you want. I've done this when each process was responsible for web calls that took an arbitrary amount of time. Each process then slept for a minute and restarted. It's easy to let the OS do the task switching, and the real limit on number of processes is usually memory or ulimit stuff (which you can fix). – Kevin J. Rice May 11 '18 at 15:00