4

I am trying to download images from a list of URLs using Python. To make the process faster, I used the multiprocessing library.

The problem I am facing is that the script often hangs/freezes on its own, and I don't know why.

Here is the code that I am using

...
import multiprocessing as mp

def getImages(val):

    #Dowload images
    try:
        url= # preprocess the url from the input val
        local= #Filename Generation From Global Varables And Rand Stuffs...
        urllib.request.urlretrieve(url,local)
        print("DONE - " + url)
        return 1
    except Exception as e:
        print("CAN'T DOWNLOAD - " + url )
        return 0

if __name__ == '__main__':

    files = "urls.txt"
    lst = list(open(files))
    lst = [l.replace("\n", "") for l in lst]

    pool = mp.Pool(processes=4)
    res = pool.map(getImages, lst)

    print ("tempw")

It often gets stuck halfway through the list (it prints DONE, or CAN't DOWNLOAD to half of the list it has processed but I don't know what is happening on the rest of them). Has anyone faced this problem? I have searched for similar problems (e.g. this link) but found no answer.

Thanks in advance

user2552108
  • 1,107
  • 3
  • 15
  • 30

2 Answers2

2

Ok, I have found an answer.

A possible culprit was the script was stuck in connecting/downloading from the URL. So what I added was a socket timeout to limit the time to connect and download the image.

And now, the issue no longer bothers me.

Here is my complete code

...
import multiprocessing as mp

import socket

# Set the default timeout in seconds
timeout = 20
socket.setdefaulttimeout(timeout)

def getImages(val):

    #Dowload images
    try:
        url= # preprocess the url from the input val
        local= #Filename Generation From Global Varables And Rand Stuffs...
        urllib.request.urlretrieve(url,local)
        print("DONE - " + url)
        return 1
    except Exception as e:
        print("CAN'T DOWNLOAD - " + url )
        return 0

if __name__ == '__main__':

    files = "urls.txt"
    lst = list(open(files))
    lst = [l.replace("\n", "") for l in lst]

    pool = mp.Pool(processes=4)
    res = pool.map(getImages, lst)

    print ("tempw")

Hope this solution helps others who are facing the same issue

user2552108
  • 1,107
  • 3
  • 15
  • 30
0

It looks like you're facing a GIL issue : The python Global Interpreter Lock basically forbid python to do more than one task at the same time. The Multiprocessing module is really launching separate instances of python to get the work done in parallel.

But in your case, urllib is called in all these instances : each of them is trying to lock the IO process : the one who succeed (e.g. come first) get you the result, while the others (trying to lock an already locked process) fail.

This is a very simplified explanation, but here are some additionnal ressources :

You can find another way to parallelize requests here : Multiprocessing useless with urllib2?

And more info about the GIL here : What is a global interpreter lock (GIL)?

CoMartel
  • 3,521
  • 4
  • 25
  • 48
  • 2
    Thanks for the response. I am not sure GIL was the culprit, but it is useful to learn that GIL was one of the possible cases. – user2552108 Mar 02 '18 at 04:11