0

In my python code I make a call to an external api to get a list of images' urls. For each of these urls I create a thread to generate a thumbnail. Here is the part of the code of interest:

def process_image(image, size, cropping, counter, queue):
    options = dict(crop=cropping)
    img = get_thumbnail(image['url'], size, **options)
    queue.put((counter, img))
    return img

...

queue = Queue()

# Get some information about an artist. Images are also included.
artist = get_profile(artist_id, buckets)

# Generate images' thumbnails
threads = [Thread(target=process_image, args=(img, '500', 'center', counter, queue)) for counter, img in enumerate(artist.data['images'])]

for p in threads:
    p.start()
for p in threads:
    p.join()

imgs = []
# Collect processed images from threads
while not queue.empty():
    el = queue.get()
    imgs.append((el[0], el[1]))

My problem is that some of the urls don't work, what I mean is that if I copy-paste the url in the browser it keeps loading and loading and loading a bit more until a Time Out is returned. Obviously I added multithreading to speed things up. The first URL that causes this problem is the 4th one, so if I add...

# Generate images' thumbnails
threads = [Thread(target=process_image, args=(img, '500', 'center', counter, queue)) for counter, img in enumerate(artist.data['images'])]
treads = threads[:3]

everything works as expected and very quick, otherwise it gets blocked for a long time and it finally terminates the execution. I would like to set some kind of timeout (say 1 second) for the thread to run the function and if the url does not work and the thread does not finish before the timeout then exit that thread.

Thank you for your help in advance.

pypy
  • 443
  • 5
  • 19
  • 2
    Seems like you should have the timeout as part of the URL request, rather than make the thread in charge of killing itself if it's taking too long. Is there a way to do a timeout in this library you're using? – turbulencetoo Oct 15 '15 at 22:18
  • I'd also remember that multi-threading in python often won't speed things up because only one thread can be executed by the interpreter at any moment. See the https://wiki.python.org/moin/GlobalInterpreterLock – turbulencetoo Oct 15 '15 at 22:19
  • @turbulencetoo has a point regarding letting the url request timeout. I'm more familiar with 'multiprocessing' versus 'threading', but doesn't "join" take an optional timeout parameter? Also, using multiprocessing will enable you to truly go in parallel if you have more than 1 CPU. – RobertB Oct 15 '15 at 22:19

3 Answers3

0

If the get_thumbnail function is yours, I'd build a timeout into it as suggested by @turbulencetoo. Otherwise, take a look at the signal module to add a timeout into process_image. As suggested in comments, you may also see further benefit in using multiprocessing versus threading.The interface to the multiprocessing module is almost identical to that of threading so it shouldn't be much work to switch.

John Greenall
  • 1,670
  • 11
  • 17
0

As described in other questions, there is no official way to kill threads in Python. In cases where the thread is doing work that you control (rather than blocking e.g. on a network request), you can use signal variables to have the threads kill themselves, but this doesn't seem to be the case here.

For downloading multiple resources in parallel, you are probably going to want to use a library like pycurl that will use OS-specific features to allow multiple requests to execute asychronously on a single thread. This lets you use methods like set_timeout that provide a fairly clean way to deal with the issue you describe.

Community
  • 1
  • 1
Myk Willis
  • 12,306
  • 4
  • 45
  • 62
0

I've finally found a solution based on @turbulencetoo's comment.

get_thumbnail was not part of my code, but an external library's, so I couldn't set any type of timeout in my code. I thought this library didn't have a config item to set a timeout during the url request but apparently there is (I had already read about it and I misunderstood).

@RobertB Yes, join() has a timeout argument and I already tried setting that parameter but it didn't work.

pypy
  • 443
  • 5
  • 19