2

I'm writing a script in Python that will scrape some pages from my web server and put them in a file. I'm using mechanize.Browser() module for this particular task.

However, I've found that creating one single instance of mechanize.Browser() is rather slow. Is there a way I could relatively painlessly use multihreading/multiprocessing (i.e. issue several GET requests at once)?

cwallenpoole
  • 79,954
  • 26
  • 128
  • 166
Bo Milanovich
  • 7,995
  • 9
  • 44
  • 61
  • Have you looked at the Python [threading](http://docs.python.org/library/threading.html) module? – ObscureRobot Oct 20 '11 at 05:10
  • Isn't threading module only for starting a new CPU thread? – Bo Milanovich Oct 20 '11 at 05:12
  • Related: http://stackoverflow.com/questions/4119680/multiple-asynchronous-connections-with-urllib2-or-other-http-library and http://stackoverflow.com/questions/4139988/multiple-urllib2-connections and http://stackoverflow.com/questions/6905800/multiprocessing-useless-with-urllib2 – amit kumar Oct 20 '11 at 05:17
  • well, if you don't want to use threading as @ObscureRobot suggested, you can try [multiprocessing](http://docs.python.org/library/multiprocessing.html). – imm Oct 20 '11 at 05:30
  • ObscureRobot and imm: I don't want CPU threads. As my post says, I want "[to] issue several GET requests at once" - as in HTTP GET request. @phaedrus - thanks, those are an interesting read. Doesn't seem to be very easy to implement, looks like I'd have to rewrite the entire app (over 3000 lines of code) – Bo Milanovich Oct 20 '11 at 05:53
  • @deusdies, we cant know how to help you unless you give us enough context to isolate what is so hard about using multiprocessing with your code. Sample code illustrating the problem would make this an answerable question – Mike Pennington Oct 20 '11 at 10:05
  • related: [Problem with multi threaded Python app and socket connections](http://stackoverflow.com/questions/4783735/problem-with-multi-threaded-python-app-and-socket-connections) – jfs Oct 23 '11 at 18:08
  • have you tried [scrapy](http://doc.scrapy.org/en/latest/intro/overview.html) – jfs Oct 23 '11 at 18:12

2 Answers2

1

Use gevent or eventlet to get concurrent network IO.

cerberos
  • 7,705
  • 5
  • 41
  • 43
1

If you want industrial strength Python web scraping, check out scrapy. It uses Twisted for async comms and is blindingly fast. Being able to spider through 50 pages per-second isn't an unrealistic expectation.

synthesizerpatel
  • 27,321
  • 5
  • 74
  • 91