2

I created a file with just an em dash in it in Notepad and saved this file with Unicode (big endian) encoding. In Notepad, this displays an em dash. When I open the file and read it like this in Python 3/IDLE:

open(file_path, encoding="UTF-16-BE").read()

I get this:

'\ufeff—'

Expressed as bytes, the files contents are this:

b'\xfe\xff \x14'

Shouldn't it be handling the BOM and not displaying it? I looked at the available encodings for Python and there was nothing like a UTF_16_BE_SIG in there as there is for UTF_8_SIG. What is going on here and how do I handle it properly?

martineau
  • 119,623
  • 25
  • 170
  • 301
Melab
  • 2,594
  • 7
  • 30
  • 51

1 Answers1

2

When you specify the endianness of UTF-16, you disable the BOM check. If you want the codec to examine and remove the BOM, specify the non-endian codec:

open(file_path, encoding="UTF-16").read()
Robᵩ
  • 163,533
  • 20
  • 239
  • 308