0

I am using python 2.7 on Windows machine. I have an array of urls accompanied by data and headers, so POST method is required. In simple execution it works well:

    rescodeinvalid =[]
    success = []
    for i in range(0,len(HostArray)):
       data = urllib.urlencode(post_data)
       req = urllib2.Request(HostArray[i], data)
       response = urllib2.urlopen(req)
       rescode=response.getcode()

       if responsecode == 400:
            rescodeinvalid.append(HostArray[i])

       if responsecode == 200:
           success.append(HostArray[i])

My question is if HostArray length is very large, then it is taking much time in loop. So, how to check each url of HostArray in a multithread. If response code of each url is 200, then I am doing different operation. I have arrays to store 200 and 400 responses. So, how to do this in multithread in python

imp
  • 1,967
  • 2
  • 28
  • 40
  • Possible duplicate of http://stackoverflow.com/questions/13481276/threading-in-python-using-queue ? And be careful not to open too many sockets at once, see http://stackoverflow.com/questions/9487569/windows-limitation-on-number-of-simultaneously-opened-sockets-connections-per-ma – nodakai Feb 09 '14 at 17:15
  • possible duplicate of [Multiple (asynchronous) connections with urllib2 or other http library?](http://stackoverflow.com/questions/4119680/multiple-asynchronous-connections-with-urllib2-or-other-http-library) – Piotr Dobrogost Feb 09 '14 at 21:07

3 Answers3

1

scrapy uses twisted library to call multiple urls in parallel without the overhead of opening a new thread per request, it also manage internal queue to accumulate and even prioritize them as a bonus you can also restrict number of parallel requests by settings maximum concurrent requests, you can either launch a scrapy spider as an external process or from your code, just set spider start_urls = HostArray

Guy Gavriely
  • 11,228
  • 6
  • 27
  • 42
1

If you want to do each one in a separate thread you could do something like:

  rescodeinvalid =[]
  success = []

  def post_and_handle(url,post_data)
       data = urllib.urlencode(post_data)
       req = urllib2.Request(url, data)
       response = urllib2.urlopen(req)
       rescode=response.getcode()

       if responsecode == 400:
              rescodeinvalid.append(url) # Append is thread safe
       elif responsecode == 200:
              success.append(url)  # Append is thread safe

  workers = []
  for i in range(0,len(HostArray)):
         t = threading.Thread(target=post_and_handle,args=(HostArray[i],post_data))
         t.start()
         workers.append(t)

  # Wait for all of the requests to complete
  for t in workers:
       t.join()

I'd also suggest using requests: http://docs.python-requests.org/en/latest/

as well as a thread pool: Threading pool similar to the multiprocessing Pool?

Thread pool usage:

from multiprocessing.pool import ThreadPool

# Done here because this must be done in the main thread
pool = ThreadPool(processes=50) # use a max of 50 threads

# do this instead of Thread(target=func,args=args,kwargs=kwargs))
pool.apply_async(func,args,kwargs)

pool.close() # I think
pool.join()
Community
  • 1
  • 1
frmdstryr
  • 20,142
  • 3
  • 38
  • 32
  • Thanks for answer. One question, if len(HostArray) is large, then how many thread will start. Is there any limitation with number of threads in windows can start. – imp Feb 10 '14 at 06:00
  • It'll start a thread for each, so it's best to use a thread pool (see the link). – frmdstryr Feb 10 '14 at 13:15
  • I looked into from multiprocessing.pool import ThreadPool, but i am not getting how to add this in our main code. Can u please suggest. Thanks – imp Feb 12 '14 at 07:46
0

Your case (basically processing a list into another list) looks like an ideal candidate for concurrent.futures (see for example this answer) or you may go all the way to Executor.map. And of course use ThreadPoolExecutor to limit the number of concurrently running threads to something reasonable.

Community
  • 1
  • 1
mcepl
  • 2,688
  • 1
  • 23
  • 38