1

I am using HTMLParser to extract an image url from a simple html text like this:

html = <p><span style="font-size: 17px;"><span style="color: #993300;"><img style="margin-right: 15px; vertical-align: top;" src="images/announcements.png" alt="announcements" /><cite>some message I would like to preserve with its formatting</cite></span></span></p>

Now I also need a version of the above html without the img tag, but am having difficulty with closing tags in the right spot. Here is what I tried:

class MyHtmlParser(HTMLParser):
    '''
    Parse simple url to extract data and image url.
    This is expecting a simple url containing only one data block and one iimage url.
    '''
    def __init__(self):
        HTMLParser.__init__(self)
        self.noImgHtml = ''

    def handle_starttag(self, tag, attrs):
        if tag == 'img':
            for a in attrs:
                if a[0] == 'src':
                    self.imageUrl = a[1]
        else:
            print '<%s>' % tag
            self.noImgHtml += '<%s>' % tag
            for a in attrs:
                print '%s=%s' % a
                self.noImgHtml += '%s=%s' % a

    def handle_endtag(self, tag):
        self.noImgHtml += '</%s>' % tag

    def handle_data(self, data):
        self.noImgHtml += data

The output of MyHtmlParser().feed(html) is this:

<b>LATEST NEWS:</b><p><span>style=font-size: 17px;<span>style=color: #993300;</img><cite>The image uploader works again, so make sure to use some screenshots in your uploads/tutorials to make your submission look extra nice</cite></span></span></p>

As you can see (and as is expected from my code flow), the tags aren't closed the way they were in the original html (e.g. span>).

can this be done easily with HTMLParser or should I resort to RE to extract the image tag (which doesn't seem very elegant)?

I can't use external modules to do this so need to make do with what HTMLParser has to offer.

Thanks in advance, frank

Frank Rueter
  • 729
  • 1
  • 10
  • 21

2 Answers2

0

In fact, your code is working, you may use

parser = MyHtmlParser()
parser.feed(html)
parser.noImgHtml

is really what you want. I tried it, the output is

<p><span>style=font-size: 17px;<span>style=color: #993300;</img><cite>some message I would like to preserve with its formatting</cite></span></span></p>

except that you need to change the handle_endtag function to

def handle_endtag(self, tag):
    if tag != 'img'
        self.noImgHtml += '</%s>' % tag

to exclude the endtag of img.

In fact MyHtmlParser().feed(html) only print the result, it returns nothing. The reason the tags are not closed properly in the printed output is you did not print the endtag and the content of the tag in handle_endtag and handle_data.

If you are trying to deal with nested divs, Alex answer here maybe helpful. How can I use the python HTMLParser library to extract data from a specific div tag? .

Community
  • 1
  • 1
flyingfoxlee
  • 1,764
  • 1
  • 19
  • 29
  • thanks, but this would still not leave the tags with attributes formatted properly, right? I just stumbled over HTMLParser.get_starttag_text() which seems to be what I need to reconstruct the original html – Frank Rueter Nov 13 '13 at 03:13
  • I see the problem here, you may also tweak the `handle_starttag` method, in the `else` part, add ``. – flyingfoxlee Nov 13 '13 at 03:44
  • But obvious `get_starttag_text` should be used since we don't need to reinvent the wheel. – flyingfoxlee Nov 13 '13 at 03:50
0

HTMLParser.get_starttag_text() seems to be the ticket to reconstruct the original html. This seems to work:

class MyHtmlParser(HTMLParser):
    '''
    Parse simple url to extract data and image url.
    This is expecting a simple url containing only one data block and one iimage url.
    '''
    def __init__(self):
        HTMLParser.__init__(self)
        self.noImgHtml = ''

    def handle_starttag(self, tag, attrs):
        if tag == 'img':
            for a in attrs:
                if a[0] == 'src':
                    self.imageUrl = a[1]
        else:
            self.noImgHtml += self.get_starttag_text()


    def handle_endtag(self, tag):
        if tag != 'img':
            self.noImgHtml += '</%s>' % tag

    def handle_data(self, data):
        self.noImgHtml += data
        self.text = data
Frank Rueter
  • 729
  • 1
  • 10
  • 21