What are good Perl or Python starting points for a site scraping library?

Question

Possible Duplicate:
How can I screen scrape with Perl?
Web scraping with Python

This isn't my field of work, so pardon the general lack of knowledge. I'm looking for a Python or Perl library for site scraping (getting some products information from a site / tables on various pages / into a more user friendly format - Excel - for which both languages have satisfactory options) with good documentation.

Can anybody give a recommendation or a starting point on the subject? Googling gives several interesting matches, but having a short bit of time I'd rather not go hunting on a wrong track, but would rather trust someone with some experience in the matter.

possible duplicate of [How can I screen scrape with Perl?](http://stackoverflow.com/questions/713827/how-can-i-screen-scrape-with-perl) and http://stackoverflow.com/questions/832673/what-is-the-best-way-to-programmatically-log-into-a-web-site-in-order-to-screen?rq=1 — Thilo, Aug 02 '12 at 00:01
@Thilo - Yes, there were several questions on this topic. However, most of them are a list of answers, which really don't get me anywhere closer. A lot of them even don't have examples of usage in their documentation. THat's why I put that specifically, since I'm on a quick course here. — Rook, Aug 02 '12 at 00:06

Snakes and Coffee · Answer 1 · 2012-08-02T00:08:08.397

5

In python there is a library called scrapy as well as more basic ones such as using mechanize or other interface with a parser such as lxml or beautifulsoup

In the comments it was mentioned that they do not have tutorials, but using mechanize is relatively simple (using its browser object) while lxml provides an easy way to jump around the dom using xpath.

Although I have never used it, Selenium also seems like a good option, albeit much more complicated

edited Aug 02 '12 at 00:08

answered Aug 02 '12 at 00:03

Snakes and Coffee

8,747
4
40
60

I would second beautifulsoup for parsing the results. For dealing with real-life web pages, it's leaps and bounds better than any other parser I've tried to work with. – Mark Tozzi Aug 02 '12 at 00:10
Thanks Just. I'll try scrapy and soup and see what I can come up with. If you also, by any chance, know of any tutorials on this topic in general that could be understandable to someone who hasn't got a clue (mostly NumPy programming up to now :/ don't be shy to put it up :) – Rook Aug 02 '12 at 00:12
Try this - http://www.pixelmender.com/2010/10/12/scraping-data-using-scrapy-framework/ – Snakes and Coffee Aug 02 '12 at 00:13

score 1 · Answer 2 · answered Aug 02 '12 at 00:15

I needed to hunt down all instances of a pesky HTML class a few days ago, and threw together the following in next to no time - it's both a scraper and a crawler, and it's tiny.

import sys
import urllib.parse as uparse
import httplib2
from bs4 import BeautifulSoup

http = httplib2.Http()

hit_urls = set()

def crawl(url, check, do, depth=1):
    global hit_urls
    if url in hit_urls:
        #print("**Skipping %s" % url)
        return
    #print("Crawling %s" % url, file=sys.stderr)
    hit_urls.add(url)

    _, response = http.request(url)
    soup = BeautifulSoup(response)

    resp = do(url, soup)

    if depth > 0:
        for link in soup.find_all('a'):
            if link.has_key('href'):
                rel_url = link['href']
                if(check(rel_url)):
                    crawl(uparse.urljoin(url,rel_url), check, do, depth-1)

    return resp

def isLocal(url):
    if not url.startswith('/'):
        return False
    if url.startswith('/goToUrl'): # 3rd party redirect page
        return False
    return True

def findBadClass(url, soup):
    for t in soup.find_all(True,{'class': 'badClass'}):
        print(url+":"+str(t))

if __name__ == '__main__':
    crawl('http://example.com', isLocal, findBadClass)

I don't have BeautifulSoup on my machine as of yet, but I'll definitely play with it tomorrow. A good starting point! Thanks dimo! — Rook, Aug 02 '12 at 00:30
If you download it and run `python setup.py build` that will create a `lib/something/` directory containing everything you need to run Beautiful Soup, and you can drop that /something/ directory into your code as a module to import. You can also run `python setup.py install` which will drop it into your python install for you automatically, but I personally prefer to manually add libraries. — dimo414, Aug 02 '12 at 00:41
Yup, same here. It's not a problem of installing the library, but more that it's now 2:49 am in here now. I didn't think people would respond that quickly when I posted the question :-) — Rook, Aug 02 '12 at 00:50

Jesse Aldridge · Answer 3 · 2012-08-02T04:02:30.543

If you just want to scrape a handful of sites with consistent formatting, the easiest thing would probably be to use requests combined with regular expressions and python's built-in string processing.

import re

import requests


resp = requests.get('http://austin.craigslist.org/cto/')

regex = ('<a href="(http://austin.craigslist.org/cto/[0-9]+\.html)">'
         '([a-zA-z0-9 ]+)</a>')

for i, match in enumerate(re.finditer(regex, resp.content)):
    if i > 5:
        break
    url = match.group(1)
    print 'url:', url
    resp = requests.get(url)
    title = re.search('<h2>(.+)</h2>', resp.content).group(1)
    print 'title:', title
    body = resp.content.split('<div id="userbody">', 1)[1]
    body = body.split('<script type="text/javascript">')[0]
    body = body.split('<!-- START CLTAGS -->')[0]
    print 'body:', body
    print

Edit: To clarify, I've used Beautiful Soup and think it's overrated. I thought it was weird and wonky and hard to use in real-world circumstances. Also, it's too much work to learn a new library for a one-off scraper -- you're better off using standard techniques (ie the ones I suggested above) that can be applied elsewhere when doing python scripting.

This is why I included the initial caveat. There's no reason not to use regular expressions in certain cases. — Jesse Aldridge, Aug 02 '12 at 01:09
I agree about beautiful soup being wonky and being better off with standard techniques, however in this case the standard technique is xpath not regex and therefore lxml is preferred. — pguardiario, Aug 02 '12 at 12:30

What are good Perl or Python starting points for a site scraping library?

3 Answers3