I'm using html.parser from the HTMLParser class to get the data out of a collection of html files. It goes pretty well until a file comes along and the throws an error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8a in position 419: invalid start byte
My code goes as follows:
class customHTML(HTMLParser):
# Parses the Data found
def handle_data(self, data):
data = data.strip()
if(data):
splitData = data.split()
# Remove punctuation!
for i in range(len(splitData)):
splitData[i] = re.sub('[%s]' % re.escape(string.punctuation), '', splitData[i])
newCounter = Counter(splitData)
global wordListprint
wordList += newCounter
.
.
.
This is in main:
for aFile in os.listdir(inputDirectory):
if aFile.endswith(".html"):
parser = customHTML(strict=False)
infile = open(inputDirectory+"/"+aFile)
for line in infile:
parser.feed(line)
On the parser.feed(line), though, is where everything breaks. It's always the same UnicodeDecodeError. I have no control over what the html files contains, so I need to make it so that I can send it into the parser. Any ideas?