7

I have a strange situation and cannot figure it out after lots of hit-trials. I am using multi-threading (10) for reading urls (100) and it works fine in most cases but in some situation, it gets stuck at the last thread. I waited for it to see if it returns and it took a lot of time (1050 seconds) whereas the rest of the nine threads returned within 25 seconds. It shows something is wrong with my code but can't figure it out. Any ideas?

Note1: It happens for both daemon and non-daemon threads.

Note2: The number of URLs and thread changes. I tried a different number of URLs from 10-100 and various threads from 5-50.

Note3: The URLs are most of the time completely different.

import urllib2
import Queue
import threading
from goose import Goose

input_queue = Queue.Queue()
result_queue = Queue.Queue()

Thread Worker:

def worker(input_queue, result_queue):
    queue_full = true
    while queue_full:
        try:
            url = input_queue.get(False)
            read a url using urllib2 and goose
            process it
            result_queue.put(updated value)
        except Queue.Empty:
           queue_full = False

Main process:

for url in urls:
    input_queue.put(url)
thread_count = 5 
for t in range(thread_count):
        t = threading.Thread(target=worker, args= (input_queue, result_queue))
        t.start()

for url in urls:
    url = result_queue.get() # updates url   

The process gets blocked at the last result_queue.get() call.

NOTE: I am more interested in what I am doing wrong here, in case someone can point that out? Because I tend to think that I wrote correct code but apparently that's not the case.

utengr
  • 3,225
  • 3
  • 29
  • 68
  • Is it always the same url that causes your app to freeze? If so, have you tried to reach it in your browser? – Right leg Aug 22 '17 at 15:50
  • @right leg Nope, I tried different urls. It happens with various urls but always at the last queue.get call. – utengr Aug 22 '17 at 15:51
  • How many urls do you request? – Right leg Aug 22 '17 at 15:51
  • between 10-100. I try different numbers. For some, it works fine, for others it gets stuck at last call. – utengr Aug 22 '17 at 15:53
  • My first guess would be that the host refuses further connections past a certain point, but I would need more tests to determine if it's true... Anyway, what you could try is cancel a task when the delay is too long, and try it again. – Right leg Aug 22 '17 at 15:55
  • 1
    What's the `q` stand for? – stamaimer Aug 22 '17 at 15:56

2 Answers2

2

For example, i take URL as a list of numbers

import urllib2
import Queue
import threading
#from goose import Goose

input_queue = Queue.Queue()
result_queue = Queue.Queue()


def worker(input_queue, result_queue):

    while not input_queue.empty():

        try:
            url = input_queue.get(False)
            updated_value = int(url) * 9
            result_queue.put(updated_value)
        except Queue.Empty:
            pass



urls = [1,2,3,4,5,6,7,8,9]

for url in urls:
    input_queue.put(url)

thread_count = 5 

for i in range(thread_count):
    t = threading.Thread(target=worker, args= (input_queue, result_queue))
    t.start()
    t.join()

for url in urls:
    try:
        url = result_queue.get() 
        print url
    except Queue.Empty:
        pass

Output

9
18 
27
36
45
54
63
72
81
Kallz
  • 3,244
  • 1
  • 20
  • 38
  • using join for each thread makes the process very slow. So it's not feasible for my use case (reading URLs). However, if I don't use it then I am stuck at the last thread. – utengr Aug 24 '17 at 14:50
2

You can use ThreadPoolExecutor from concurrent.futures.

from concurrent.futures import ThreadPoolExecutor

MAX_WORKERS = 50

def worker(url):

    response = requests.get(url)

    return response.content

with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:

    results = executor.map(worker, urls)

for result in results:

    print(result)
stamaimer
  • 6,227
  • 5
  • 34
  • 55
  • is the use of concurrent.futures any better than the normal multi-threading option? especially in terms of performance? I am aware that it makes the use of the interface easier. – utengr Aug 23 '17 at 11:30
  • @sheldr You can check [this question](https://stackoverflow.com/questions/20776189/concurrent-futures-vs-multiprocessing-in-python-3). – stamaimer Aug 23 '17 at 12:34