0

I am studying web scraping for big data, so I wrote the following code to take some information from a local server on our campus. It works fine but I think the performance is very slow; each record takes 0.91s to get stored in the database. What the code does is open a web page, take some content and store it on disk.

My goal is to lower the time elapsed for scraping a record to something near 0.4s (or less, if possible).

#!/usr/bin/env python

import scraperwiki
import requests
import lxml.html
for i in range(1, 150):
    try:
        html = requests.get("http://testserver.dc/"+str(i)"/").content 
        dom = lxml.html.fromstring(html)
        for entry in dom.cssselect('.rTopHeader'):
            name = entry.cssselect('.bold')[0].text_content()

        for entry in dom.cssselect('div#rProfile'):
            city = entry.cssselect('li:nth-child(2) span')[0].text_content()

        for entry in dom.cssselect('div#rProfile'):
            profile_id = entry.cssselect('li:nth-child(3) strong a')[0].get('href')
        profile = {
                'name':name,
                'city':city,
                'profile_id':profile_id
            }
        unique_keys = [ 'profile_id' ]
        scraperwiki.sql.save(unique_keys, profile)
        print jeeran_id            
    except:
        print 'Error: ' + str(i)
Mohammad Abu Musa
  • 1,117
  • 2
  • 10
  • 32
  • You have 3 places where you consume time : the http get, the dom manipulation and the database save. What you should do is first do the loop with only the get, next with get and dom manipulation but without database save, and finally full treatement. That will give you where you can gain time. – Serge Ballesta May 11 '14 at 09:55
  • 1
    At some point, there might not be much you can do about the speed of scraping per record because this is dependent on the latency of your network. You might instead consider doing the scraping in parallel through multithreading. – joemar.ct May 11 '14 at 10:00

1 Answers1

2

It is very good, you have clear aim at how far you want to optimize.

Measure time needed for scraping first

It is likely, the limiting factor is scraping the urls.

Simplify your code and measure, how long it takes to scrape. If this does not meet your timing criteria (like one request would take 0.5 seconds), you have to go do scraping in parallel. Search StackOverflow, there are many such questions and answers, using threading, green threads etc.

Further optimization tips

Parsing XML - use iterative / SAX approach

Your DOM creation can be turned into iterative parsing. It need less memory and is very often much faster. lxml allows methods like iterparse. Also see related SO answer

Optimize writing into database

Writing many records one by one can be turned into writing them in bunches.

Community
  • 1
  • 1
Jan Vlcinsky
  • 42,725
  • 12
  • 101
  • 98
  • On the database optimization, `scraperwiki.sql.save` is slow when saving records individually. You can [pass it a list of dicts instead](https://github.com/scraperwiki/scraperwiki-python#saving-data). As Jan says, scraping time is likely to be the rate determining step though. – Steven Maude May 11 '14 at 12:38