Multithreading in Python will only make the script faster at the points where IO is blocking due to the GIL
, any CPU/IO intensive applications will unlikely see any performance increase (if anything, they might get slower).
I have written scrapers for a magnitude of different sites (some as large as 8+ TB in data). Python will struggle to get full line rate on a single script, your best bet is to use a proper job queue (such as celery
), then run multiple workers to achieve concurrency.
If you don't want celery
, then another hacky approach is to use subprocess
to call multiple instances of curl/wget/axel
then block until they return, check exit code, check the file exists etc. However if your script does not exit cleanly, then you will end up with zombie processes (i.e. downloads continuing even after you kill the script). If you don't like the idea of subprocess
, then you can use something like eventlet
or gevent
but you won't achieve full line rate on a single script, you'll have to run multiple workers.
Some sites have rate limiting, so using a job queue is usually a great way of getting around this (i.e. lots of EC2 instances with random IPs), with X number of workers on each to get maximum throughput.
Python is a perfectly fine tool for scraping huge amounts of data, you just have to do it correctly.
Also, pyquery is significantly faster than BeautifulSoup in many cases for processing results. At the very least, don't rely on the BeautifulSoup library to request the data for you. Use something like python-requests
to fetch the result, then pass it into your parser (i.e. soup or pyquery etc).
There are also scalability considerations if you plan on scraping/storing large amounts of data, such as bandwidth optimizations when processing jobs and downloading content. There are storage clusters which allow you to send a URL to their API, and they handle downloading the content for you. This saves wasting bandwidth by downloading then uploading the file into your backend - this can cut your bandwidth bill in half.
It's also worth mentioning that threading+BeautifulSoup has been discussed already;
Urllib2 & BeautifulSoup : Nice couple but too slow - urllib3 & threads?