I've been searching the web for a solution to address reading files with different encodings and I've found many instances of "it's impossible to tell what encoding a file is" (so if anyone is reading this and has a link I would appreciate it). However, the problem I was dealing with was a bit more focused than "open any file encoding" but rather open a set of known encodings. I am by no means an expert at this topic but I thought I would post my solution in case anyone ran into this issue.
Specific example:
Known file encodings: utf8, and windows ansi
Initial Issue: as I now know, not specifying a encoding to python's open('file', 'r')
command auto defaults to encoding='utf8' That raised a UnicodeDecodeError at runtime when trying to f.readline()
a ansi file. A common search on this is: "UnicodeDecodeError: 'utf-8' codec can't decode byte"
Secondary Issue: so then I thought okay, well simple enough, we know the exception that's being raised so read a line and if it raises this UnicodeDecodeError then close the file and reopen it with open('file', 'r', encoding='ansi')
. The problem with this was that sometimes utf8 was able to read the first few lines of an ansi encoded file just fine but then failed on a later line. Now the solution became clear; I had to read through the entire file with utf8 and if it failed then I knew that this file was a ansi.
I'll post my take on this as an answer but if someone has a better solution, I would also appreciate that :)