1

I am trying to read a file that contains this character in it "ë". The problem is that I can not figure out how to read it no matter what I try to do with the encoding. When I manually look at the file in textedit it is listed as a unknown 8-bit file. If I try changing it to utf-8, utf-16 or anything else it either does not work or messes up the entire file. I tried reading the file just in standard python commands as well as using codecs and can not come up with anything that will read it correctly. I will include a code sample of the read below. Does anyone have any clue what I am doing wrong? This is Python 2.17.10 by the way.

readFile = codecs.open("FileName",encoding='utf-8')

The line I am trying to read is this with nothing else in it.

Aeëtes

Here are some of the errors I get:

UnicodeDecodeError: 'utf8' codec can't decode byte 0x91 in position 0: invalid start byte

UTF-16 stream does not start with BOM" UnicodeError: UTF-16 stream does not start with BOM -- I know this one is that it is not a utf-16 file.

UnicodeDecodeError: 'ascii' codec can't decode byte 0x91 in position 0: ordinal not in range(128)

If I don't use a Codec the word comes in as Ae?tes which then crashes later in the program. Just to be clear, none of the suggested questions or any other anywhere on the net have pointed to an answer. One other detail that might help is that I am using OS X, not Windows.

Jongware
  • 22,200
  • 8
  • 54
  • 100
Jimmy
  • 175
  • 1
  • 3
  • 17
  • Can you paste the file, or at least some part of it? What's the error you're getting? – ffledgling Sep 09 '16 at 16:27
  • 1
    Please provide some error or unexpected results. There is also `"utf-8-sig"` encoding that might helps. – C.LECLERC Sep 09 '16 at 16:32
  • The error changes based on which encoding I use. Here is one of them. UnicodeDecodeError: 'utf8' codec can't decode byte 0x91 in position 0: invalid start byte – Jimmy Sep 09 '16 at 16:32
  • @C.LECLERC I tried the -sig and that gave the same results as well. Thanks though. – Jimmy Sep 09 '16 at 16:39
  • 1
    Looks like http://stackoverflow.com/questions/38019379/python-unicodedecodeerror-utf8-codec-cant-decode-byte-0x91 – C.LECLERC Sep 09 '16 at 16:40
  • 2
    Where was the file written? Was it perhaps in an environment that uses some weird encoding like Windows legacy code pages? See a similar question here: http://stackoverflow.com/q/6344853/2988730 – Mad Physicist Sep 09 '16 at 17:14
  • @MadPhysicist I am not sure where the file was written, so it may well have a weird encoding. I looked through that question and a bunch of others, but nothing that would fix my issue that I could see. Thanks – Jimmy Sep 09 '16 at 18:10
  • 1
    Have you tried constructing a list of possible encodings and looped through them until one works? – Mad Physicist Sep 09 '16 at 18:30
  • 1
    So I searched for an encoding where `0x91` represents a character `ë`. As for your "One More Thing" note: imagine my surprise when this turned out to be so in [Mac Roman Encoding](https://en.wikipedia.org/wiki/Mac_OS_Roman). – Jongware Sep 09 '16 at 20:15
  • @RadLexus if you want to throw that as an answer I will mark it. When I went through all the different encodings I specifically left off the MAC encodings since I was able to find out that it was a windows file. I guess my computer must have changed it. Thanks – Jimmy Sep 09 '16 at 22:16
  • If you know what specific codec to use (I don't), feel free to answer it yourself. I.e., I know what the encoding is but cannot say what it'd be in Python. – Jongware Sep 09 '16 at 22:20

1 Answers1

1

Credit for this answer goes to RadLexus for figuring out the proper encoding and also to Mad Physicist who pointed me in the right track even if I did not consider all possible encodings.

The issue is apparently a Mac will convert the .txt file to mac_roman. If you use that encoding it will work perfectly.

This is the line of code that I used to convert it.

readFile = codecs.open("FileName",encoding='mac_roman')
Jongware
  • 22,200
  • 8
  • 54
  • 100
Jimmy
  • 175
  • 1
  • 3
  • 17