How do I extract HTML list entries into a Python list?

Question

Possible Duplicate:
Parsing HTML in Python

I have a long string of HTML similar to the following:

<ul>
<li><a href="/a/long/link">Class1</a></li>
<li><a href="/another/link">Class2</a></li>
<li><img src="/image/location" border="0">Class3</a></li>
</ul>

It has several list entries (Class1 to Class8). I'd like to turn this into a list in Python with only the class names, as in

["Class1", "Class2", "Class3"]

and so on.

How would I go about doing this? I've tried using REs, but I haven't been able to find a method that works. Of course, with only 8 classes I could easily do it manually, but I have several more HTML documents to extract data from.

Thanks! :)

Check out the documentation for http://docs.python.org/library/htmlparser.html — Alex Churchill, Aug 09 '11 at 21:20
http://stackoverflow.com/questions/3276040/how-can-i-use-the-python-htmlparser-library-to-extract-data-from-a-specific-div-t if you want an example of HTMLParser — Alex Churchill, Aug 09 '11 at 21:23
Try BeautifilSoup by: `soup = BeautifilSoup(html); soup2.findAll("li", text=True);`, it'll return all the class names. — kenorb, Jun 22 '14 at 14:19
See also [Only extracting text from this element, not its children](http://stackoverflow.com/questions/4995116/only-extracting-text-from-this-element-not-its-children). — kenorb, Jun 22 '14 at 14:21

score 2 · Answer 1 · answered Aug 09 '11 at 21:43

Check out lxml (pip install lxml). You'll want to do a little more research, but effectively it comes down to something like this:

from lxml import etree

tree = etree.HTML(page_source)
def parse_list(xpath):
    ul = tree.xpath(xpath)
    return [child.text for child in ul.getchildren()]

score 0 · Answer 2 · answered Aug 09 '11 at 21:22

If all the line endings are the same, you could try a regular expression like

re.compile(r'^<li><.*>(.*)</a></li>$')

If you're expecting much more variability in the file than in your example, then something like an HTML parser would probably be better.

score 0 · Accepted Answer · answered Aug 09 '11 at 21:42

This should work but take it just as a quick and ugly hack, do not parse HTML with regular expressions

>>> hdata = """<ul>
... <li><a href="/a/long/link">Class1</a></li>
... <li><a href="/another/link">Class2</a></li>
... <li><img src="/image/location" border="0">Class3</a></li>
... </ul>"""
>>> import re
>>> lire = re.compile(r'<li>.*?>(.*?)<.*')
>>> [lire.search(x).groups()[0] for x in hdata.splitlines() if lire.search(x)]
    ['Class1', 'Class2', 'Class3']

You could try to use Element Tree if your source is valid XML, otherwise look for Beautiful Soup

Thanks! I actually did use Beautiful Soup to isolate the list from the rest of the HTML document, but wasn't sure how to go further than that. I'll have a look at it :) — , Aug 10 '11 at 15:49

How do I extract HTML list entries into a Python list?

3 Answers3