70

I'm reading a series of source code files using Python and running into a unicode BOM error. Here's my code:

bytes = min(32, os.path.getsize(filename))
raw = open(filename, 'rb').read(bytes)
result = chardet.detect(raw)
encoding = result['encoding']

infile = open(filename, mode, encoding=encoding)
data = infile.read()
infile.close()

print(data)

As you can see, I'm detecting the encoding using chardet, then reading the file in memory and attempting to print it. The print statement fails on Unicode files containing a BOM with the error:

UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-2:
character maps to <undefined>

I'm guessing it's trying to decode the BOM using the default character set and it's failing. How do I remove the BOM from the string to prevent this?

dda
  • 6,030
  • 2
  • 25
  • 34
Chris
  • 27,596
  • 25
  • 124
  • 225
  • Just wondering, what does `chardet` return as the encoding when the data starts with a UTF-8 BOM? Seems that would be a pretty big hint that the encoding was UTF-8 :^) – Mark Tolonen Nov 28 '12 at 01:04
  • 2
    @MarkTolonen: [it was a bug](https://github.com/chardet/chardet/pull/8) that is [fixed now](http://stackoverflow.com/a/32774741/4279) – jfs Sep 25 '15 at 04:35

8 Answers8

105

There is no reason to check if a BOM exists or not, utf-8-sig manages that for you and behaves exactly as utf-8 if the BOM does not exist:

# Standard UTF-8 without BOM
>>> b'hello'.decode('utf-8')
'hello'
>>> b'hello'.decode('utf-8-sig')
'hello'

# BOM encoded UTF-8
>>> b'\xef\xbb\xbfhello'.decode('utf-8')
'\ufeffhello'
>>> b'\xef\xbb\xbfhello'.decode('utf-8-sig')
'hello'

In the example above, you can see utf-8-sig correctly decodes the given string regardless of the existence of BOM. If you think there is even a small chance that a BOM character might exist in the files you are reading, just use utf-8-sig and not worry about it

lightswitch05
  • 9,058
  • 7
  • 52
  • 75
  • 2
    I hadn't come across utf-8-sig until now --- thanks! Is there any good reason why stripping the BOM is not the default behavior for the more obvious value of encoding? I mean, does anyone ever actually want to see the BOM in a string read from a text file? – AdamF Feb 01 '19 at 08:05
  • 3
    @AdamF I think the idea between `utf-8`and `utf-8-sig` is to not have unexpected behavior/magic. I'm glad Python `utf-8` decodes the file as-is, the BOM is a character in the file, so it makes sense to preserve it. I'm also very glad for `utf-8-sig` where stripping it is handled automatically. While I don't know of a case where someone would want the BOM, I'm sure use cases exist. With these two encodings, we get to decide our own expected behavior. – lightswitch05 Mar 21 '19 at 14:03
67

BOM characters should be automatically stripped when decoding UTF-16, but not UTF-8, unless you explicitly use the utf-8-sig encoding. You could try something like this:

import io
import chardet
import codecs

bytes = min(32, os.path.getsize(filename))
raw = open(filename, 'rb').read(bytes)

if raw.startswith(codecs.BOM_UTF8):
    encoding = 'utf-8-sig'
else:
    result = chardet.detect(raw)
    encoding = result['encoding']

infile = io.open(filename, mode, encoding=encoding)
data = infile.read()
infile.close()

print(data)
Chewie
  • 7,095
  • 5
  • 29
  • 36
  • 13
    Funny that `chardet` doesn't automatically do this. – Mark Ransom Nov 27 '12 at 19:25
  • OK, I did this, and it appears to be working, but then Python is throwing a weird error complaining that \u2019 (right single quote) cannot be decoded using utf-8-sig. How exactly do I handle that? – Chris Nov 27 '12 at 19:35
  • Nevermind, I was getting that because the console doesn't support that character. Problem solved. – Chris Nov 27 '12 at 19:42
  • Yeah, that happens. [This article](http://wiki.python.org/moin/PrintFails) comes in handy in such cases. – Chewie Nov 27 '12 at 19:46
  • 1
    +1 on @MarkRansom comment: does anybody have an idea why chardet doesn't do it automagically? – Ronan Jouchet May 24 '13 at 12:56
  • I'd guess because that in essence removes a character from the input stream. A general purpose encoding detection can't know if you'll need the BOM mark. – abesto Feb 08 '14 at 07:06
  • @abesto it doesn't affect the input stream at all - it merely yields the correct detected encoding. What you choose to do with the input stream as a result of that knowledge is completely up to you. – Stephen Fuhry Apr 24 '14 at 18:33
  • 1
    @StephenJ.Fuhry WDYM it doesn't affect the input stream? It's part of the input stream. If the application parsing the input stream as UTF-* doesn't understand the BOM mark, then it yields a strange character. If the application parsing the input stream as UTF-* understands the BOM mark, then what you described happens. In this case the problem is exactly the the application does not understand BOM. So what you described does not happen. Which is exactly the problem. – abesto Apr 25 '14 at 14:48
  • python 2 does not has encoding parameter in open function – bowman han Jul 30 '14 at 05:52
  • [`chardet`'s bug with `BOM_UTF8` is fixed](http://stackoverflow.com/a/32774741/4279). Though the answer is inconsistent either way: utf-16le BOM leads to `encoding='UTF-16LE'` that cause BOM to be left in the stream (which inconsistent with `utf-8-sig` that strips BOM from the stream). – jfs Sep 25 '15 at 04:45
  • @MarkTolonen: did you mean to address your comment to me? Here's what's happening: if input starts with `BOM_UTF8` then BOM (the thing at the start of the file) is removed (`utf-8-sig` is used in the answer and chardet 2.3.0+). If input starts with utf-16 BOM (le or be) then BOM is *not* removed (chardet returns `'utf-16le'`, `'utf-16be'` encodings in this case, not `'utf-16'` <- no le or be). To be consistent, the result should be either `('utf-8-sig', 'utf-16', ..)` or `('utf-8', 'utf-16le', 'utf-16be', ..)` but not the mix. – jfs Sep 25 '15 at 15:38
28

I've composed a nifty BOM-based detector based on Chewie's answer.

It autodetects the encoding in the common use case where data can be either in a known local encoding or in Unicode with BOM (that's what text editors typically produce). More importantly, unlike chardet, it doesn't do any random guessing, so it gives predictable results:

def detect_by_bom(path, default):
    with open(path, 'rb') as f:
        raw = f.read(4)    # will read less if the file is smaller
    # BOM_UTF32_LE's start is equal to BOM_UTF16_LE so need to try the former first
    for enc, boms in \
            ('utf-8-sig', (codecs.BOM_UTF8,)), \
            ('utf-32', (codecs.BOM_UTF32_LE, codecs.BOM_UTF32_BE)), \
            ('utf-16', (codecs.BOM_UTF16_LE, codecs.BOM_UTF16_BE)):
        if any(raw.startswith(bom) for bom in boms):
            return enc
    return default
ivan_pozdeev
  • 33,874
  • 19
  • 107
  • 152
11

chardet detects BOM_UTF8 automatically since 2.3.0 version released on Oct 7, 2014:

#!/usr/bin/env python
import chardet # $ pip install chardet

# detect file encoding
with open(filename, 'rb') as file:
    raw = file.read(32) # at most 32 bytes are returned
    encoding = chardet.detect(raw)['encoding']

with open(filename, encoding=encoding) as file:
    text = file.read()
print(text)

Note: chardet may return 'UTF-XXLE', 'UTF-XXBE' encodings that leave the BOM in the text. 'LE', 'BE' should be stripped to avoid it -- though it is easier to detect BOM yourself at this point e.g., as in @ivan_pozdeev's answer.

To avoid UnicodeEncodeError while printing Unicode text to Windows console, see Python, Unicode, and the Windows console.

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • Am I mistaken or does [`open`](https://docs.python.org/2/library/functions.html#open) actually doesn't have an `encoding` keyword argument? – Yan Foto Dec 16 '15 at 13:59
  • 1
    @YanFoto: it has on Python 3. Use `io.open` on older versions. – jfs Dec 16 '15 at 14:03
9

I find the other answers overly complex. There is a simpler way that doesn't need dropping down into the lower-level idiom of binary file I/O, doesn't rely on a character set heuristic (chardet) that's not part of the Python standard library, and doesn't need a rarely-seen alternate encoding signature (utf-8-sig vs. the common utf-8) that doesn't seem to have an analog in the UTF-16 family.

The simplest approach I've found is dealing with BOM characters in Unicode, and letting the codecs do the heavy lifting. There is only one Unicode byte order mark, so once data is converted to Unicode characters, determining if it's there and/or adding/removing it is easy. To read a file with a possible BOM:

BOM = '\ufeff'
with open(filepath, mode='r', encoding='utf-8') as f:
    text = f.read()
    if text.startswith(BOM):
        text = text[1:]

This works with all the interesting UTF codecs (e.g. utf-8, utf-16le, utf-16be, ...), doesn't require extra modules, and doesn't require dropping down into binary file processing or specific codec constants.

To write a BOM:

text_with_BOM = text if text.startswith(BOM) else BOM + text
with open(filepath, mode='w', encoding='utf-16be') as f:
    f.write(text_with_BOM)

This works with any encoding. UTF-16 big endian is just an example.

This is not, btw, to dismiss chardet. It can help when you have no information what encoding a file uses. It's just not needed for adding / removing BOMs.

Jonathan Eunice
  • 21,653
  • 6
  • 75
  • 77
  • This doesn't work for me when the text file uses utf-16 LE. When reading the file, I get "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte". – criddell Jul 31 '19 at 16:34
  • @criddell Did you use the code above explicitly? If so, you may've tried to read a `utf-16be`-encoded file with a `utf-8` codec. The above example presents both encodings being used to show breadth. In practice, if you write a file with a `utf-16be` encoding, you must also read that file with the same encoding. Technique is tested and does work. [Example here](https://gist.github.com/jonathaneunice/9f41ae6a01654e8bec35ab72bc1b03dd) – Jonathan Eunice Jul 31 '19 at 18:04
  • I see what you are saying now. Your first block of code is what mislead me. In it you are reading the file as utf-8 and then look for a utf-16 BOM at the start. Your example code you linked to is different. For my project, I don't know the encoding before hand and so your answer isn't the one for me. – criddell Aug 01 '19 at 19:17
  • The solution to BOM management is independent of not knowing what encoding you're reading (gnarly problem in and of itself). They can be combined, however. See e.g. [this function](https://gist.github.com/jonathaneunice/6c1337876fe7eb74a4ba48b57a4869a4) – Jonathan Eunice Aug 01 '19 at 20:06
  • Simplest, hence the best, solution on this page. – Sanctus Jan 21 '20 at 17:39
  • This solution duplicates by hand what `utf-8-sig` is doing. It's equivalent to `open(<...>,encoding='utf-8-sig')`. – ivan_pozdeev Feb 10 '20 at 16:12
  • @ivan_pozdeev Sure...when the base encoding is `utf-8`. But AFAIK, none of the UTF-16 or UTF-32 encodings have a similar `-sig` variant. This technique works uniformly across all encodings, rather than for just one. – Jonathan Eunice Feb 10 '20 at 20:24
  • `utf-16` and `utf-32` are the "sig" variants. – ivan_pozdeev Feb 10 '20 at 22:46
2

In case you want to edit the file, you will want to know which BOM was used. This version of @ivan_pozdeev answer returns both encoding and optional BOM:

def encoding_by_bom(path, default='utf-8') -> Tuple[str, Optional[bytes]]:
    """Adapted from https://stackoverflow.com/questions/13590749/reading-unicode-file-data-with-bom-chars-in-python/24370596#24370596 """

    with open(path, 'rb') as f:
        raw = f.read(4)    # will read less if the file is smaller
    # BOM_UTF32_LE's start is equal to BOM_UTF16_LE so need to try the former first
    for enc, boms in \
            ('utf-8-sig', (codecs.BOM_UTF8,)), \
            ('utf-32', (codecs.BOM_UTF32_LE, codecs.BOM_UTF32_BE)), \
            ('utf-16', (codecs.BOM_UTF16_LE, codecs.BOM_UTF16_BE)):
        for bom in boms:
            if raw.startswith(bom):
                return enc, bom
    return default, None

ikamen
  • 3,175
  • 1
  • 25
  • 47
0

A variant of @ivan_pozdeev's answer for strings/exceptions (rather than files). I'm dealing with unicode HTML content that was stuffed in a python exception (see http://bugs.python.org/issue2517)

def detect_encoding(bytes_str):
  for enc, boms in \
      ('utf-8-sig',(codecs.BOM_UTF8,)),\
      ('utf-16',(codecs.BOM_UTF16_LE,codecs.BOM_UTF16_BE)),\
      ('utf-32',(codecs.BOM_UTF32_LE,codecs.BOM_UTF32_BE)):
    if (any(bytes_str.startswith(bom) for bom in boms): return enc
  return 'utf-8' # default

def safe_exc_to_str(exc):
  try:
    return str(exc)
  except UnicodeEncodeError:
    return unicode(exc).encode(detect_encoding(exc.content))

Alternatively, this much simpler code is able to delete non-ascii characters without much fuss:

def just_ascii(str):
  return unicode(str).encode('ascii', 'ignore')
Dave Dopson
  • 41,600
  • 19
  • 95
  • 85
  • 1
    You shouldn't see BOM inside a bytestring in memory (it should be stripped in the code that decodes a file). Your default (utf-8) may raise exception during decoding. BOM does not guarantee the encoding will be successful. Use `errors='backslashreplace'` instead. Unrelated: (1) don't use bare `except:` it catches too much, even `KeyboardInterrupt`. (2) don't use `\` and bracket instead `for enc, boms in [...]:` – jfs Sep 25 '15 at 04:30
  • @J.F.Sebastian - I switched to "except Exception". I'm not sure I understand your #2 feedback. FWIW, I'm seeing BOM characters from HTML that came over the wire and was subsequently stuffed into a python Exception. – Dave Dopson Sep 29 '15 at 02:28
  • (1) if you expect a character soup on input then it is even more likely that `encode()` might raise an exception inside the exception handler (2) don't loose the exception info: use `'backslashreplace'` instead of `'ignore'` error handler (3) I meant that you could use `[]` brackets instead of backslashes to break the expression in the `for`-loop into multiple lines. – jfs Sep 29 '15 at 02:44
0

I prefer this solution when dealing with a BOM marker:

with open(filename, "r", encoding='utf-8-sig') as f:
    text = f.read()

Documentation on

Windel
  • 499
  • 4
  • 11