2

I am trying to do the following.

  1. I have 8 cores.

  2. I execute 8 process as follows where core_aa is a filename that loads urls into a queue

    python threaded_crawl.py core_aa --max_async_count=20 --use_headers --verbose > /tmp/core_aa.out
    python threaded_crawl.py core_ab --max_async_count=20 --use_headers --verbose > /tmp/core_ab.out
    python threaded_crawl.py core_ac --max_async_count=20 --use_headers --verbose > /tmp/core_ac.out
    python threaded_crawl.py core_ad --max_async_count=20 --use_headers --verbose > /tmp/core_ad.out
    python threaded_crawl.py core_ae --max_async_count=20 --use_headers --verbose > /tmp/core_ae.out
    python threaded_crawl.py core_af --max_async_count=20 --use_headers --verbose > /tmp/core_af.out
    python threaded_crawl.py core_ag --max_async_count=20 --use_headers --verbose > /tmp/core_ag.out
    python threaded_crawl.py core_ah --max_async_count=20 --use_headers --verbose > /tmp/core_ah.out
    
  3. Each if the proccess is a threaded app that runs 20 threads whose job is to fetch a url. If I have e.g. 60K urls and I run one proccess the job completes with all threads living till the queue is empty

  4. If I run more than one process I notice that the threads start to die slowly e.g. one death per 1000. Idea os to split the 60K for one proccess to 8. total nummber of threads is 20*8

  5. Each of the process share no data.

So given that one job works of one, why would executing multiple process kill threads?

How can I fix?

class ThreadClass(threading.Thread):
def __init__(self,parms={},proxy_list=[],user_agent_list=[],use_cookies=True,fn=None,verbose=False):
        threading.Thread.__init__(self)
 def run(self):
    while page_queue.qsize()>0:
         FETCH URLS....


for page in xrange(THREAD_LIMIT):
        tc = ThreadClass(parms=parms,proxy_list=proxy_list,user_agent_list=user_agent_list,use_cookies=use_cookies,fn=fn,verbose=verbose)
        tc.start()
        while threading.activeCount()>=THREAD_LIMIT:
            time.sleep(1)
        while threading.activeCount()>1:
                time.sleep(1)

I have do idea how to debug and there is no error. Given that I have the following condition,

while threading.activeCount()>1:
                time.sleep(1)

Once the threads are all dead, the code continues even though there are items left in the queue when the threads should run until the queue is empty.

very confused.

Once the active count

Dan D.
  • 73,243
  • 15
  • 104
  • 123
Tampa
  • 75,446
  • 119
  • 278
  • 425
  • PS..for each thread I am using opener = urllib2.build_opener(). There is chatter about file descriptor c.f. http://stackoverflow.com/questions/9308166/in-python-when-threads-die. Could this be the cause? I am not opening files but I am using urllib2 to fetch web pages. – Tampa Feb 23 '12 at 20:15
  • PPS when I add the folling code to start a new thread if max thread count falls below the threashold its works --> while threading.activeCount()>1: if threading.activeCount()THREAD_LIMIT: ntc = THREAD_LIMIT - threading.activeCount() for i in xrange(ntc): tc = ThreadClass(parms=parms,proxy_list=proxy_list,user_agent_list=user_agent_list,use_cookies=use_cookies,fn=fn,verbose=verbose) tc.start() print "started a new thread" time.sleep(1) – Tampa Feb 23 '12 at 21:40
  • 1
    If you want you can [edit] your question and add more information, it will be more clear than with a comment. – Rik Poggi Feb 24 '12 at 01:23

1 Answers1

3

.qsize() returns an approximate size. Don't use page_queue.qsize() > 0 to check whether the queue is empty. You could use while True: .. page_queue.get() .. and a sentinel to know when you are done, example or queue.task_done(), queue.join() combination.

Catch exceptions in .run() method to avoid killing a thread prematurely.

Don't use .activeCount() if you need n threads then just create n threads.

Make your threads daemonic to be able to interrupt your program at any moment.

You don't need multiple processes if your program is IO bound. Otherwise you could use multiprocessing module to manage multiple processes instead of launching them by hand.

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670