I want to parse HTML in python

Question

I have this small class:

class HTMLTagStripper(HTMLParser):
    def __init__(self):
       self.reset()
       self.fed = []
    def handle_data(self, data):
       self.fed.append(data)
    def handle_starttag(self, tag, attrs):
       if tag == 'a':
           return attrs[0][1]
    def get_data(self):
       return ''.join(self.fed)

parsing this HTML code:

<div id="footer">
<p>long text.</p>
<p>click <a href="somelink.com">here</a>
</div>

This is the result I get: long text click here
but I want to get: long text click somelink.com

Is there a way to do this?

If there is the will... I know I will be shot at here for this suggestion, but if all you want to do is remove tags you can use a regex :-) — Simon Bergot, Jun 19 '12 at 13:28
[Please don't parse HTML with RegEx](http://stackoverflow.com/a/1732454/189134) Use [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/) or another library designed for it instead. — Andy, Jun 19 '12 at 13:33

Levon · Answer 1 · 2012-06-20T01:58:07.747

Take a look at BeautifulSoup .. it will do that and much more.

Or you could use regular expressions/string operations to strip out the data you want. In the long run using something like BeautifulSoup will pay off, especially if you expect to do more of this.

Here's one way to use BeautifulSoup to extract the single/only link in your HTML data (I'm not an expert with this, so there may be other, better ways - suggestions/corrections welcome).

from BeautifulSoup import BeautifulSoup
s = """<div id="footer">
       <p>long text.</p>
       <p>click <a href="somelink.com">here</a>
       </div>"""

soup = BeautifulSoup(s)
your_link = soup.find('a', href=True)['href']
print 'long text click', your_link

will print:

long text click somelink.com

@user1307624 If this solved your problem please consider [accepting this answer by clicking on the checkmark](http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work/5235#5235) next to my answer. It will mark this problem as solved and reward us both with some rep points. Thanks. — Levon, Jun 23 '12 at 00:51

bcelary · Answer 2 · 2012-06-19T13:52:39.460

0

This WILL NOT work for you:

x = re.compile(r'<.*?>')
stripped = x.sub('', html)

as you also would like to extract some properties (like href) from the html tags.

As Levon points out: you should go for BeautifulSoup.

edited Jun 19 '12 at 13:52

answered Jun 19 '12 at 13:28

bcelary

1,827
1
17
17

Ah, right. Thanks for pointing this out. Haven't noticed that in the question. – bcelary Jun 19 '12 at 13:47

score 0 · Answer 3 · answered Jun 19 '12 at 14:28

Replacing this:

def handle_starttag(self, tag, attrs):
   if tag == 'a':
       return attrs[0][1]

With this:

def handle_starttag(self, tag, attrs):
   if tag == 'a':
       value = dict(attrs).get("href", None)
       if value:
           # add extra spaces since you dont sanitize
           # them in get_data
           self.fed.append(" %s " % value)

should kind of work. Or not, depending on the html source code. That's why we have BeatifulSoup.

score 0 · Accepted Answer · answered Jul 19 '12 at 03:45

I was actually checking out this new html parser library and come up with this solution:

from htmldom import htmldom
dom = htmldom.HtmlDom().createDom( """<div id="footer">
<p>long text.</p>
<p>click <a href="somelink.com">here</a>
</div>""");
nodes = dom.find( "p" ).children( all_children = True ) # this makes all text nodes to be in the set.
for node in nodes:
    if node._is( "a" ):
        print( node.attr( "href" ).strip() )
    elif node._is( "text" ):
        print( node.getNode().text, end = '', sep = ' ' )

You can download the library from Sourceforge or from python package index: HtmlDom, works on python 3.x, documentation of the library is not that good but it is understandable. Hope you like the answer:)

You can find documentation at [Documentation](http://thehtmldom.sourceforge.net/) — coder, Jul 19 '12 at 03:47

I want to parse HTML in python

4 Answers4