2

Possible Duplicate:
Parsing HTML in Python

I have a long string of HTML similar to the following:

<ul>
<li><a href="/a/long/link">Class1</a></li>
<li><a href="/another/link">Class2</a></li>
<li><img src="/image/location" border="0">Class3</a></li>
</ul>

It has several list entries (Class1 to Class8). I'd like to turn this into a list in Python with only the class names, as in

["Class1", "Class2", "Class3"]

and so on.

How would I go about doing this? I've tried using REs, but I haven't been able to find a method that works. Of course, with only 8 classes I could easily do it manually, but I have several more HTML documents to extract data from.

Thanks! :)

Community
  • 1
  • 1
  • 1
    Check out the documentation for http://docs.python.org/library/htmlparser.html – Alex Churchill Aug 09 '11 at 21:20
  • 1
    http://stackoverflow.com/questions/3276040/how-can-i-use-the-python-htmlparser-library-to-extract-data-from-a-specific-div-t if you want an example of HTMLParser – Alex Churchill Aug 09 '11 at 21:23
  • Try BeautifilSoup by: `soup = BeautifilSoup(html); soup2.findAll("li", text=True);`, it'll return all the class names. – kenorb Jun 22 '14 at 14:19
  • See also [Only extracting text from this element, not its children](http://stackoverflow.com/questions/4995116/only-extracting-text-from-this-element-not-its-children). – kenorb Jun 22 '14 at 14:21

3 Answers3

2

Check out lxml (pip install lxml). You'll want to do a little more research, but effectively it comes down to something like this:

from lxml import etree

tree = etree.HTML(page_source)
def parse_list(xpath):
    ul = tree.xpath(xpath)
    return [child.text for child in ul.getchildren()]
Ceasar
  • 22,185
  • 15
  • 64
  • 83
0

If all the line endings are the same, you could try a regular expression like

re.compile(r'^<li><.*>(.*)</a></li>$')

If you're expecting much more variability in the file than in your example, then something like an HTML parser would probably be better.

dpitch40
  • 2,621
  • 7
  • 31
  • 44
0

This should work but take it just as a quick and ugly hack, do not parse HTML with regular expressions

>>> hdata = """<ul>
... <li><a href="/a/long/link">Class1</a></li>
... <li><a href="/another/link">Class2</a></li>
... <li><img src="/image/location" border="0">Class3</a></li>
... </ul>"""
>>> import re
>>> lire = re.compile(r'<li>.*?>(.*?)<.*')
>>> [lire.search(x).groups()[0] for x in hdata.splitlines() if lire.search(x)]
    ['Class1', 'Class2', 'Class3']

You could try to use Element Tree if your source is valid XML, otherwise look for Beautiful Soup

Facundo Casco
  • 10,065
  • 8
  • 42
  • 63
  • Thanks! I actually did use Beautiful Soup to isolate the list from the rest of the HTML document, but wasn't sure how to go further than that. I'll have a look at it :) –  Aug 10 '11 at 15:49