0

I have a list with links, and I want to use 5 threads to scrape all links. I managed to do a thread for each link, but I will have over 1000 links, and I don't want my computer and the website to break.

My question is how to use a fixed number of threads to scrape 100 links for example ?

Here is what I got right now:

    def main():
       urls = ["http://google.com","http://yahoo.com"]
       threads = []

    #Starting all the requests 
       for url in urls:
            thread = threading.Thread(target=loadurl, args=(url,))
        thread.start()
        print "[+] Thread started for:", url
        threads.append(thread)

    #All requests started
       print "[+] Requests done"
       for thread in threads:
        thread.join()
       print "[+] Finished!"

    #Just print the source
    def loadurl(url):
       page = urllib2.urlopen(url);
       soup = BeautifulSoup(page); 
       print soup
icebox19
  • 493
  • 3
  • 6
  • 15
  • 3
    Can you please fix the indentation of your post? Indentation is really important in Python, and your code, as posted, won't run due to inconsistencies. – Martijn Pieters Jan 11 '14 at 18:15
  • 1
    I suggest you use a `ThreadPool` as described in [one](http://stackoverflow.com/a/18283388/355230) of the answers to a similar question titled [_Execute multiple threads concurrently_](http://stackoverflow.com/questions/18281434/execute-multiple-threads-concurrently). – martineau Jan 11 '14 at 19:07
  • In Python 3, use `concurrent.futures.ThreadPoolExecutor`; in Python 2, write each thread as "an infinite loop" and have each pull work descriptions off of a shared `Queue.Queue`. – Tim Peters Jan 11 '14 at 22:52

0 Answers0