-1

I resolved this problem, but I'd like to understand the why.

I am on a Windows 10 pc running Python 3.9.6. I had a simple text file with a single line in it, which was just:

Fifty_50

I had been running a small python utility file for some time opening files like this and parsing through the contents without any issue, but I had been using Python 3.7. My code was very simple:

with open(companyfile) as companies:
    for company in companies:
    ...

When I ran this yesterday, I started getting garbage instead of the text out of this simple one line file. I decided it was likely because I wasn't providing encoding and changed the code to:

with open(companyfile, 'r', encoding='utf-8') as companies:

That gave me this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Finally, I tried utf-16, and the file opened and processed normally.

So my question is do I have to always specify utf-16 now that I'm using Python 3.9? There were no special characters in the simple file that I was trying to open; so I don't understand why it had a problem.

Any insight would be appreciated.

Thanks--

Al

Al G
  • 115
  • 4
  • 10
  • 1
    Try opening the file with open(companyfile, 'rb') and view the contents of the file. This will prevent the decoding of the bytes so you can see the contents of the file and check for extraneous characters. – belfner Sep 09 '21 at 16:56
  • I did try that, while vim saw nothing odd in the file, opening it with 'rb' did find extraneous characters. Since I created the file by hand with vim myself, I'm not sure where the extraneous characters came from... that was my first clue though. – Al G Sep 10 '21 at 01:54
  • Have you been able to recreate this issue with a new file? – belfner Sep 10 '21 at 02:04
  • You have to open the file in the encoding it was saved in. For whatever reason, that file was saved in UTF-16 encoding. The default for `open()` is `locale.getpreferredencoding(False)` which for US Windows is typically `cp1252`. – Mark Tolonen Sep 21 '21 at 17:35

1 Answers1

0

Each character in a utf-16 is 16 bits, or 2 bytes, long, as denoted in the name. attempting to open it as a utf-8 encoded file won't work because of the incompatibility between the two encodings on a fundamental level. I think most files I use are utf-8, but a lot of Microsoft programs (like Powershell and Excel) will generate text documents in utf-16 by default instead.

In terms of "guessing" the encoding, there isn't really a "right" way to do it. There's no universal byte sequence in any file that designates what encoding was used, because encodings are rather arbitrary, and a new one could be designed at any time.

David Culbreth
  • 2,610
  • 16
  • 26
  • That's not entirely true. Or at least there's nuance. There is the [byte order mark](https://en.wikipedia.org/wiki/Byte_order_mark) required in UTF-16 and UTF-32, and optional in UTF-8 (the standard specifies that it is allowed, though many \*nix systems don't handle UTF-8 BOMs correctly). However, if you knew that a file were encoded with *a* UTF encoding, you can determine which type and which endianness. If the initial byte sequence matches a known UTF BOM, it's not unreasonable to guess that encoding. – Bacon Bits Sep 09 '21 at 17:13
  • So should I always open files 'rb', check the BOM, and then open with the correct utf encoding? If so, gak... Would there be some way (when creating files by hand) to ensure that they are utf-8? – Al G Sep 10 '21 at 01:59
  • @AlG Most if not all text editors let you choose the encoding. If creating on Python, specify the encoding when opening the file, e.g. `open(file,'w',encoding='utf8')`. On Windows, `utf-8-sig` might be preferable as that writes a BOM character encoded in UTF-8 as a signature at the start of the file, and Windows apps like Excel will recognize this and assume UTF-8 instead of the ANSI default encoding (usually `cp1252`, but varies with localization). – Mark Tolonen Sep 21 '21 at 17:40