55

I am trying to encode Bangla words in python using pandas dataframe. But as encoding type, utf-8 is not working but utf-8-sig is. I know utf-8-sig is with BOM(Byte order mark). But why is this called utf-8-sig and how it works?

JuBaer AD
  • 696
  • 1
  • 5
  • 7
  • 2
    I believe this is a duplicate: https://stackoverflow.com/questions/2223882/whats-the-difference-between-utf-8-and-utf-8-without-bom – juanpa.arrivillaga Jul 22 '19 at 20:08
  • Possible duplicate of [python utf-8-sig BOM in the middle of the file when appending to the end](https://stackoverflow.com/questions/23154355/python-utf-8-sig-bom-in-the-middle-of-the-file-when-appending-to-the-end) – Rory Daulton Jul 22 '19 at 20:10
  • 1
    I don't find anything related to the 'sig' part. That's why I asked this question. – JuBaer AD Jul 25 '19 at 06:32
  • 9
    This is a perfectly valid question. – Nobilis Feb 21 '20 at 10:43

1 Answers1

65

"sig" in "utf-8-sig" is the abbreviation of "signature" (i.e. signature utf-8 file). Using utf-8-sig to read a file will treat the BOM as metadata that explains how to interpret the file, instead of as part of the file contents. Read more on Python's codecs documentation page"

To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python calls "utf-8-sig") for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written.

Nate Anderson
  • 18,334
  • 18
  • 100
  • 135
eee
  • 781
  • 5
  • 8