I am studying web scraping for big data, so I wrote the following code to take some information from a local server on our campus. It works fine but I think the performance is very slow; each record takes 0.91s to get stored in the database. What the code does is open a web page, take some content and store it on disk.
My goal is to lower the time elapsed for scraping a record to something near 0.4s (or less, if possible).
#!/usr/bin/env python
import scraperwiki
import requests
import lxml.html
for i in range(1, 150):
try:
html = requests.get("http://testserver.dc/"+str(i)"/").content
dom = lxml.html.fromstring(html)
for entry in dom.cssselect('.rTopHeader'):
name = entry.cssselect('.bold')[0].text_content()
for entry in dom.cssselect('div#rProfile'):
city = entry.cssselect('li:nth-child(2) span')[0].text_content()
for entry in dom.cssselect('div#rProfile'):
profile_id = entry.cssselect('li:nth-child(3) strong a')[0].get('href')
profile = {
'name':name,
'city':city,
'profile_id':profile_id
}
unique_keys = [ 'profile_id' ]
scraperwiki.sql.save(unique_keys, profile)
print jeeran_id
except:
print 'Error: ' + str(i)