What is the best practice for writing maintainable web scrapers?

Question

I need to implement a few scrapers to crawl some web pages (because the site doesn't have open API), extracting information and save to database. I am currently using beautiful soup to write code like this:

discount_price_text = soup.select("#detail-main del.originPrice")[0].string;
discount_price = float(re.findall('[\d\.]+', discount_price_text)[0]);

I guess code like this can very easily become invalid when the web page is changed, even slightly. How should I write scrapers less susceptible to these changes, other than writing regression tests to run regularly to catch failures?

In particular, is there any existing 'smart scraper' that can make 'best effort guess' even when the original xpath/css selector is no longer valid?

Selenium . https://pypi.python.org/pypi/selenium – Priyank Patel Jan 21 '14 at 08:58 — Priyank Patel, Jan 21 '14 at 08:58

David · Accepted Answer · 2014-10-28T19:48:33.833

Pages have the potential to change so drastically that building a very "smart" scraper might be pretty difficult; and if possible, the scraper would be somewhat unpredictable, even with fancy techniques like machine-learning etcetera. It's hard to make a scraper that has both trustworthiness and automated flexibility.

Maintainability is somewhat of an art-form centered around how selectors are defined and used.

In the past I have rolled my own "two stage" selectors:

(find) The first stage is highly inflexible and checks the structure of the page toward a desired element. If the first stage fails, then it throws some kind of "page structure changed" error.
(retrieve) The second stage then is somewhat flexible and extracts the data from the desired element on the page.

This allows the scraper to isolate itself from drastic page changes with some level of auto-detection, while still maintaining a level of trustworthy flexibility.

I frequently have used xpath selectors, and it is really quit surprising, with a little practice, how flexible you can be with a good selector while still being very accurate. I'm sure css selectors are similar. This gets easier the more semantic and "flat" the page design is.

A few important questions to answer are:

What do you expect to change on the page?
What do you expect to stay the same on the page?

When answering these questions, the more accurate you can be the better your selectors can become.

In the end, it's your choice how much risk you want to take, how trustworthy your selectors will be, when both finding and retrieving data on a page, how you craft them makes a big difference; and ideally, it's best to get data from a web-api, which hopefully more sources will begin providing.

EDIT: Small example

Using your scenario, where the element you want is at .content > .deal > .tag > .price, the general .content .price selector is very "flexible" regarding page changes; but if, say, a false positive element arises, we may desire to avoid extracting from this new element.

Using two-stage selectors we can specify a less general, more inflexible first stage like .content > .deal, and then a second, more general stage like .price to retrieve the final element using a query relative to the results of the first.

So why not just use a selector like .content > .deal .price?

For my use, I wanted to be able to detect large page changes without running extra regression tests separately. I realized that rather than one big selector, I could write the first stage to include important page-structure elements. This first stage would fail (or report) if the structural elements no longer exist. Then I could write a second stage to more gracefully retrieve data relative to the results of the first stage.

I shouldn't say that it's a "best" practice, but it has worked well.

Thanks! I totally agree choosing robust selectors is an art form. I am actually thinking about writing multiple levels of selectors from very specific (like .content>.deal>.tag>.price) to very general like (.content .price), and fall back on the next level if current level fails, but I am not sure that is a good idea, because it may introduce false positives. Sometimes it is better to fail than to get the wrong data... And in your 2-stage model, what do you mean when you say the retrieving can be "somewhat flexible'? When I find the element, I just need to extract the data, right? — NeoWang, Jan 24 '14 at 05:00
What I meant by "somewhat flexible" is flexible *relative* to a sub-section of the page retrieved by the first stage selector. I've added a small example above. — David, Jan 24 '14 at 16:47

score 3 · Answer 2 · answered Jan 23 '14 at 17:05

Completely unrelated to Python and not auto-flexible, but I think the templates of my Xidel scraper have the best maintability.

You would write it like:

<div id="detail-main"> 
   <del class="originPrice">
     {extract(., "[0-9.]+")} 
   </del>
</div>

Each element of the template is matched against the elements on the webpage, and if they are the same, the expressions inside {} are evaluated.

Additional elements on the page are ignored, so if you find the right balance of included elements and removed elements, the template will be unaffected by all minor changes. Major changes on the other hand will trigger a matching failure, much better than xpath/css which will just return an empty set. Then you can change in the template just the changed elements, in the ideal case you could directly apply the diff between old/changed page to the template. In any case, you do not need to search which selector is affected or update multiple selectors for a single change, since the template can contain all queries for a single page together.

score 2 · Answer 3 · 2014-01-22T23:15:12.823

EDIT: Oops, I now see you're already using CSS selectors. I think they provide the best answer to your question. So no, I don't think there is a better way.

However, sometimes you may find that it's easier to identify the data without the structure. For example, if you want to scrape prices, you can do a regular expression search matching the price (\$\s+[0-9.]+), instead of relying on the structure.

Personally, the out-of-the-box webscraping libraries that I've tried all kind of leave something to desire (mechanize, Scrapy, and others).

I usually roll my own, using:

urllib2 (standard library),
lxml and
cssselect

cssselect allows you to use CSS selectors (just like jQuery) to find specific div's, tables etcetera. This proves to be really invaluable.

Example code to fetch the first question from SO homepage:

import urllib2
import urlparse
import cookielib

from lxml import etree
from lxml.cssselect import CSSSelector

post_data = None
url = 'http://www.stackoverflow.com'
cookie_jar = cookielib.CookieJar()
http_opener = urllib2.build_opener(
    urllib2.HTTPCookieProcessor(cookie_jar),
    urllib2.HTTPSHandler(debuglevel=0),
)
http_opener.addheaders = [
    ('User-Agent', 'Mozilla/5.0 (X11; Linux i686; rv:25.0) Gecko/20100101 Firefox/25.0'),
    ('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'),
]
fp = http_opener.open(url, post_data)
parser = etree.HTMLParser()
doc = etree.parse(fp, parser)

elem = CSSSelector('#question-mini-list > div:first-child > div.summary h3 a')(doc)
print elem[0].text

Of course you don't need the cookiejar, nor the user-agent to emulate FireFox, however I find that I regularly need this when scraping sites.

What is the best practice for writing maintainable web scrapers?

3 Answers3

Linked