What pure Python library should I use to scrape a website?

Question

I currently have some Ruby code used to scrape some websites. I was using Ruby because at the time I was using Ruby on Rails for a site, and it just made sense.

Now I'm trying to port this over to Google App Engine, and keep getting stuck.

I've ported Python Mechanize to work with Google App Engine, but it doesn't support DOM inspection with XPATH.

I've tried the built-in ElementTree, but it choked on the first HTML blob I gave it when it ran into '&mdash'.

Do I keep trying to hack ElementTree in there, or do I try to use something else?

thanks, Mark

Duplicate of all of these: http://stackoverflow.com/search?q=%5Bpython%5D+html+parse — S.Lott, Oct 13 '09 at 22:02
I might have to go with scrapy, can i use XPath with beautiful soup? — MStodd, Oct 15 '09 at 05:53
Actually I might have to go with none since I'm not sure beautiful soup works with xpath, and it looks like scrapy has a binary dependancy. — MStodd, Oct 15 '09 at 06:00

score 11 · Answer 1 · answered Oct 13 '09 at 22:01

11

Beautiful Soup.

answered Oct 13 '09 at 22:01

S.Lott

384,516
81
508
779

For some reason I was thinking that was pure python, but it looks like it is. I'll check it out. – MStodd Oct 13 '09 at 22:11
2

Second that. Beautiful Soup is incredible. – David Wolever Oct 13 '09 at 22:21
1

Right answer, but there's something fundamentally broken about getting 60+ points for being the first person to write two words. ;) – Nick Johnson Oct 14 '09 at 10:18
1

@Nick Johnson: Since it's a duplicate question, it's doubly wrong to get upvoted for answering it yet again. – S.Lott Oct 14 '09 at 11:47
It's not really a dup. I need a pure python solution that works with XPath. I don't know that any suggestions so far meet those requirements. – MStodd Oct 15 '09 at 06:02

score 6 · Answer 2 · answered Oct 13 '09 at 22:28

6

lxml -- 100x better than elementtree

answered Oct 13 '09 at 22:28

Billy Joe

61
1

3

lxml is a wrapper for a C library, so it cannot run on appengine. – Roberto Bonvallet Oct 13 '09 at 22:44
It's also going to barf just as hard on badly formed HTML. – jcdyer Oct 13 '09 at 23:38
5

jcd - not true. lxml includes several options for parsing HTML, including using BeautifulSoup as a parser backend - http://codespeak.net/lxml/elementsoup.html – Matt Good Oct 14 '09 at 04:16
lxml is now supported in Appengine in Python 2.7 https://developers.google.com/appengine/docs/python/tools/libraries27 – igniteflow Oct 02 '13 at 09:07

score 4 · Answer 3 · answered Oct 13 '09 at 22:29

4

There's also scrapy, might be more up your alley.

answered Oct 13 '09 at 22:29

Autoplectic

7,566
30
30

it does need lxml or libxml2 tough – sleeplessnerd Aug 16 '11 at 03:47

PaulMcG · Answer 4 · 2009-10-14T03:43:25.397

There are a number of examples of web page scrapers written using pyparsing, such as this one (extracts all URL links from yahoo.com) and this one (for extracting the NIST NTP server addresses). Be sure to use the pyparsing helper method makeHTMLTags, instead of just hand coding "<" + Literal(tagname) + ">" - makeHTMLTags creates a very robust parser, with accommodation for extra spaces, upper/lower case inconsistencies, unexpected attributes, attribute values with various quoting styles, and so on. Pyparsing will also give you more control over special syntax issues, such as custom entities. Also it is pure Python, liberally licensed, and small footprint (a single source module), so it is easy to drop into your GAE app right in with your other application code.

score 0 · Answer 5 · answered Nov 25 '09 at 00:18

0

BeautifulSoup is good, but its API is awkward. Try ElementSoup, which provides an ElementTree interface to BeautifulSoup.

answered Nov 25 '09 at 00:18

hoju

28,392
37
134
178

What pure Python library should I use to scrape a website?

5 Answers5

Linked