I want to scrap a site. There are about 8000 items to scrap. I have problem that if it takes 1 second to request for 1 item then it will take about 8000 seconds for these items which means it takes about 134 mints and 2.5 hours. Can anyone help about how to make it done and do the multi requests at the same time. I am using python urllib2 for requesting the contents.
-
If you do that, you'll likely get banned from the site you're trying to scrap [sic]. Did you read their Terms of Use? Is it OK with them if you scrap [sic] their site? – Robert Harvey Feb 18 '14 at 17:28
-
yes, they allow scraping. I just need the answer of my scenario. – user3324557 Feb 18 '14 at 17:38
-
Look into using python scraping tools, like beautiful soup or scrappy. I know scrappy can create multiple spiders and launch them to scrape urls at the same time [12 spiders at once default]. – Ryan G Feb 18 '14 at 17:52
2 Answers
Use better HTTP client. Urllib2 makes requests with
Connection: close
, so always new TCP connection has to be negotiated. Withrequests
, you can reuse that TCP connections.s = requests.Session() r = s.get("http://example.org")
Make requests in parallel. Since this is I/O-bound it is OK with GIL and you can use threads. You can run a few simple threads that download a batch of URLs and then wait for all of them to finish. But maybe something like "parallel map" would fit this better - I found this answer with simple example:
https://stackoverflow.com/a/3332884/196206
If you are sharing anything between threads, make sure it is thread safe - request session object seems to be thread safe: https://stackoverflow.com/a/20457621/196206
Update - a small example:
#!/usr/bin/env python
import lxml.html
import requests
import multiprocessing.dummy
import threading
first_url = "http://api.stackexchange.com/2.2/questions?pagesize=10&order=desc&sort=activity&site=stackoverflow"
rs = requests.session()
r = rs.get(first_url)
links = [item["link"] for item in r.json()["items"]]
lock = threading.Lock()
def f(data):
n, link = data
r = rs.get(link)
doc = lxml.html.document_fromstring(r.content)
names = [el.text for el in doc.xpath("//div[@class='user-details']/a")]
with lock:
print("%s. %s" % (n+1, link))
print(", ".join(names))
print("---")
# you can also return value, they will be returned
# from pool.map() in order corresponding to the links
return (link, names)
pool = multiprocessing.dummy.Pool(5)
names_list = pool.map(f, enumerate(links))
print(names_list)
-
thanx for the quick response. If I got 6 responses at the same time then how to use them with lxml to extract the data same time as they are going to be one by one in current scenario and put them on file concurrently. – user3324557 Feb 18 '14 at 17:46
-
Write the `lxml` processing together with downloading to that function called by parallel `map`. You can also write to a file there, but use a lock (http://docs.python.org/2/library/threading.html#lock-objects) to exclude parallel file writes. – Messa Feb 18 '14 at 17:50
You should consider using Scrapy instead of working directly with lxml and urllib. Scrapy is "a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages." It's built on top of Twisted so that it can be inherently asynchronous, and as a result it is very very FAST.
I can't give you any specific numbers on how much faster your scraping will go, but imagine that your requests are happening in parallel instead of serially. You'll still need to write the code to extract the information that you want, using xpath or Beautiful Soup, but you won't have to work out the fetching of pages.

- 21,866
- 6
- 108
- 99
-
Though parallel requests are obviously faster, do keep in mind that different scraping targets will have different reactions to aggressive scraping. A scraping target can make your life quite difficult if they care to (if you don't cover your tracks, possibly even legally), so be sure to respect their wishes to the extent possible. – Andrew Gorcester Feb 18 '14 at 19:29