0

I'm working with an ugly, ugly text file that may contain all kinds of ascii characters or unicode and I'm just looking to strip them out and replace certain words in the lines (perhaps Mr. becomes Mister for example). It's a basic cleaning script that needs to handle goofy text. I can handle the regex just fine. It's the loop of the filehandle that is blowing up on me:

I'm simply doing:

#this makes the file handle just fine
filehandle = open(test.txt, "r")
#the next line makes the error
for line in filehandle:
    ...do stuff

'charmap' codec can't decode byte 0x81 in position 34: character maps to

I've tried:

filehandle = open(test.txt, "r", encoding="utf-8")

And it still fails. What simple thing am I missing here? Since I can make the file handle but not loop it, do I need to iterate differently? If I test this with a typical text file containing just 'normal' words it works fine. Or if it just has perhaps an u with an umlaut it works. It really seems to be once the ASCII gets extended I have problems. Any and all help is greatly appreciated!

Here is a link to the text file, if that is helpful http://s000.tinyupload.com/download.php?file_id=16749469386677650703&t=1674946938667765070359961

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
sniperd
  • 5,124
  • 6
  • 28
  • 44
  • 1
    Take a look at this question: http://stackoverflow.com/questions/3284827/python-3-chokes-on-cp-1252-ansi-reading Your file uses probably the ISO 8859-1 encoding. – Casimir et Hippolyte May 12 '17 at 14:03
  • You da man! That did the trick. Thank you so very much. Should I write that up as an answer and accept it? I'm not sure what the best way to make this post useful to others is/give you credit for the great answer. – sniperd May 12 '17 at 14:15
  • It seems that your file uses the ISO-8859-15 (If I believe gedit). I will not post an answer since your question is a duplicate of the question I linked. – Casimir et Hippolyte May 12 '17 at 14:16
  • Great! In any event, thank you for your help. :) – sniperd May 12 '17 at 15:46

0 Answers0