UnicodeDecodeError: Python HTML Parsing

Question

I'm using html.parser from the HTMLParser class to get the data out of a collection of html files. It goes pretty well until a file comes along and the throws an error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8a in position 419: invalid start byte

My code goes as follows:

class customHTML(HTMLParser):
   # Parses the Data found
   def handle_data(self, data):
        data = data.strip()
        if(data):
            splitData = data.split()
            # Remove punctuation!
            for i in range(len(splitData)):
                splitData[i] = re.sub('[%s]' % re.escape(string.punctuation), '', splitData[i])
            newCounter = Counter(splitData)
            global wordListprint 
            wordList += newCounter

.

This is in main:

for aFile in os.listdir(inputDirectory):
    if aFile.endswith(".html"):     
        parser = customHTML(strict=False)
        infile = open(inputDirectory+"/"+aFile)
        for line in infile:
            parser.feed(line)

On the parser.feed(line), though, is where everything breaks. It's always the same UnicodeDecodeError. I have no control over what the html files contains, so I need to make it so that I can send it into the parser. Any ideas?

score 0 · Answer 1 · edited May 23 '17 at 11:49

This is a relatively common problem with quite a few SO threads. Checkout this one: Determine the encoding of text in Python

I'd like to take a moment to comment on your code as well.

Python does not need parenthesis around conditionals. Use

if foo:
    action()

not

if (foo):
    action()

You should define the use of the global once at the top of the function/method not every time through the loop.

This code:

for i in range(len(splitData)):
    splitData[i] = re.sub('[%s]' % re.escape(string.punctuation), '', splitData[i])

is better written as

for i, data in enumerate(splitData):
    splitData[i] = re.sub('[%s]' % re.escape(string.punctuation), '', data)

or as

splitData = [ re.sub('[%s]' % re.escape(string.punctuation), '', data) 
              for data in splitData ]

score 0 · Answer 2 · answered Feb 11 '14 at 02:32

0

While subclassing HTMLParser might be a good exercise, if your html isn't utf8 I'd advise using BeautifulSoup parser, which is quite good at detecting encoding automatically.

answered Feb 11 '14 at 02:32

Ryne Everett

6,427
3
37
49

UnicodeDecodeError: Python HTML Parsing

2 Answers2