17

I have been using HTML Parser to scrapping data from websites and stripping html coding whilst doing so. I'm aware of various modules such as Beautiful Soup, but decided to go down the path of not depending on "outside" modules. There is a code code supplied by Eloff: Strip HTML from strings in Python

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

It works in Python 3.1. However, I recently upgraded to Python 3.2.x and have found I get errors regarding the HTML Parser code as written above.

My first error points to the line:

s.feed(html)

... and the error says ...

AttributeError: 'MLStripper' object has no attribute 'strict'

So, after a bit of research, I add "strict=True" to the top line, making it...

class MLStripper(HTMLParser, strict=True)

However, I get the new error of:

TypeError: type() takes 1 or 3 arguments

To see what would happen, I removed the "self" argument and left in the "strict=True"... which gave up the error:

NameError: global name 'self' is not defined

... and I got the "I'm guessing on guesses" feeling.

I have no idea what the third argument in the class MLStripper(HTMLParser) line would be, after self and strict=True; research didn't toss any enlightenment.

Community
  • 1
  • 1
MilesNielsen
  • 247
  • 1
  • 4
  • 9

1 Answers1

38

You're subclassing HTMLParser, but you aren't calling its __init__ method. You need to add one line to your __init__ method:

def __init__(self):
    super().__init__()
    self.reset()
    self.fed = []

Also, for Python 3, the import line is:

from html.parser import HTMLParser

With these changes, a simple example works. Don't change the class line, that's not related.

Thomas K
  • 39,200
  • 7
  • 84
  • 86
  • 2
    That worked perfectly, Thomas K. Thank you very much! Scripts are working perfectly once again with that "super().__init__()" code inserted. – MilesNielsen Jun 16 '12 at 22:22
  • 2
    This also resolves the AttributeError: 'HTMLTagRemover' object has no attribute 'convert_charrefs' super().__init__() was NOT required in Python2 for me but was in Python3 - thanks – Simon Melouah May 19 '17 at 09:04
  • I would change last line of the get_data function to be return ' '.join(self.fed) with a space rather than an empty string. Otherwise

    Foo

    Bar

    turns into "FooBar" rather than "Foo Bar". This turns out to be a pretty common case in HTML in the wild.
    – Craig Schmidt Mar 30 '22 at 19:07