I am trying to encode Bangla words in python using pandas dataframe. But as encoding type, utf-8 is not working but utf-8-sig is. I know utf-8-sig is with BOM(Byte order mark). But why is this called utf-8-sig and how it works?
Asked
Active
Viewed 5.3k times
55
-
2I believe this is a duplicate: https://stackoverflow.com/questions/2223882/whats-the-difference-between-utf-8-and-utf-8-without-bom – juanpa.arrivillaga Jul 22 '19 at 20:08
-
Possible duplicate of [python utf-8-sig BOM in the middle of the file when appending to the end](https://stackoverflow.com/questions/23154355/python-utf-8-sig-bom-in-the-middle-of-the-file-when-appending-to-the-end) – Rory Daulton Jul 22 '19 at 20:10
-
1I don't find anything related to the 'sig' part. That's why I asked this question. – JuBaer AD Jul 25 '19 at 06:32
-
9This is a perfectly valid question. – Nobilis Feb 21 '20 at 10:43
1 Answers
65
"sig" in "utf-8-sig" is the abbreviation of "signature" (i.e. signature utf-8 file).
Using utf-8-sig to read a file will treat the BOM as metadata that explains how to interpret the file, instead of as part of the file contents. Read more on Python's codecs
documentation page"
To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python calls
"utf-8-sig"
) for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written.

Nate Anderson
- 18,334
- 18
- 100
- 135

eee
- 781
- 5
- 8