I'm working with an ugly, ugly text file that may contain all kinds of ascii characters or unicode and I'm just looking to strip them out and replace certain words in the lines (perhaps Mr. becomes Mister for example). It's a basic cleaning script that needs to handle goofy text. I can handle the regex just fine. It's the loop of the filehandle that is blowing up on me:
I'm simply doing:
#this makes the file handle just fine
filehandle = open(test.txt, "r")
#the next line makes the error
for line in filehandle:
...do stuff
'charmap' codec can't decode byte 0x81 in position 34: character maps to
I've tried:
filehandle = open(test.txt, "r", encoding="utf-8")
And it still fails. What simple thing am I missing here? Since I can make the file handle but not loop it, do I need to iterate differently? If I test this with a typical text file containing just 'normal' words it works fine. Or if it just has perhaps an u with an umlaut it works. It really seems to be once the ASCII gets extended I have problems. Any and all help is greatly appreciated!
Here is a link to the text file, if that is helpful http://s000.tinyupload.com/download.php?file_id=16749469386677650703&t=1674946938667765070359961