0

I've been searching the web for a solution to address reading files with different encodings and I've found many instances of "it's impossible to tell what encoding a file is" (so if anyone is reading this and has a link I would appreciate it). However, the problem I was dealing with was a bit more focused than "open any file encoding" but rather open a set of known encodings. I am by no means an expert at this topic but I thought I would post my solution in case anyone ran into this issue.

Specific example:

Known file encodings: utf8, and windows ansi

Initial Issue: as I now know, not specifying a encoding to python's open('file', 'r') command auto defaults to encoding='utf8' That raised a UnicodeDecodeError at runtime when trying to f.readline() a ansi file. A common search on this is: "UnicodeDecodeError: 'utf-8' codec can't decode byte"

Secondary Issue: so then I thought okay, well simple enough, we know the exception that's being raised so read a line and if it raises this UnicodeDecodeError then close the file and reopen it with open('file', 'r', encoding='ansi'). The problem with this was that sometimes utf8 was able to read the first few lines of an ansi encoded file just fine but then failed on a later line. Now the solution became clear; I had to read through the entire file with utf8 and if it failed then I knew that this file was a ansi.

I'll post my take on this as an answer but if someone has a better solution, I would also appreciate that :)

snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
PydPiper
  • 411
  • 5
  • 18
  • "Windows ANSI" is a misnomer, and not entirely well-defined. You probably mean code page 1252. – tripleee Apr 06 '19 at 17:04
  • There are many questions about resolving this within a file with mixed encodings. Falling back to Latin-1 (or CP1252 or what have you) when you get a decoding error is well-known and fairly trivial technique if you understand how UTF-8 was designed. – tripleee Apr 06 '19 at 17:11
  • 1
    Possible duplicate of [Python - dealing with mixed-encoding files](https://stackoverflow.com/questions/10009753/python-dealing-with-mixed-encoding-files) – tripleee Apr 06 '19 at 17:12
  • Hi tripleee, thanks for the response. I took a look at your link and it appears to me like it's a specific set of character fix. In my case I don't know what the characters are used in the files. The answer does seem a bit more advanced as well, and the one I posted seems simpler for a beginner that might not understand why their usual read file doesn't work – PydPiper Apr 07 '19 at 05:03
  • How did you come to the point of having text files (bytes) without the essential knowledge of which encoding each uses, yet you do know each is either UTF-8 or "Windows ANSI" (whatever that means)? Can you ask for them again? (If you receive them through HTTP, the response header might say which encoding is used.) – Tom Blodget Apr 07 '19 at 23:27
  • Hey Tom, I didn't know I was dealing with multiple encodings until I got an codec error out of nowhere. At first I didn't even know what it meant until I did a bit of digging online and in my code (on what file it failed). I pulled up the file in notepad++ and saw that it's encoding was ANSI – PydPiper Apr 07 '19 at 23:35
  • 1
    That's unfortunate. BTW—Notepad++ is guessing (and reporting in not a very informative way if it says "ANSI".) – Tom Blodget Apr 07 '19 at 23:37
  • Yeah okay, that makes sense then since what I've read online was that there is really no way to tell what a file encoding is, so that makes sense that notepad++ would also be guessing. I wrote an API that has a functionally to read in files and create a structured database out of it, and the files I've worked with have all be utf8, but apparently some of my coworkers receive whatever 'ansi' encoding is from 3rd party. My way around this was to just read through the file with utf8 first, and it it fails then guess at 'ansi'. So far seems to work until it doesn't :) – PydPiper Apr 07 '19 at 23:50

2 Answers2

0
f = open(path, 'r', encoding='utf8')
while True:
    try:
        line = f.readline()
    except UnicodeDecodeError:
        f.close()
        encodeing = 'ansi'
        break
    if not line:
        f.close()
        encoding = 'utf8'
        break

# now open your file for actual reading and data handling
with open(path, 'r', encoding=encoding) as f:
PydPiper
  • 411
  • 5
  • 18
0

If you replace the codec in the linke question by tripleee, it is

import codecs

last_position = -1

def mixed_decoder(unicode_error):
    global last_position
    string = unicode_error[1]
    position = unicode_error.start
    if position <= last_position:
        position = last_position + 1
    last_position = position
    new_char = string[position].decode("ansi")
    #new_char = u"_"
    return new_char, position + 1

codecs.register_error("mixed", mixed_decoder)

Bonus: reads as UTF-8 until an error occurs and does not need in-place error handling.

serv-inc
  • 35,772
  • 9
  • 166
  • 188