1

I was wondering whether the python library has a function that returns a file's character encoding by looking for the presence of a BOM.

I've already implemented something, but I'm just afraid I might be reinventing the wheel

Update: (based on John Machin's correction):

import codecs

def _get_encoding_from_bom(fd):
    first_bytes = fd.read(4)
    fd.seek(0)
    bom_to_encoding = (
        (codecs.BOM_UTF32_LE, 'utf-32'),
        (codecs.BOM_UTF32_BE, 'utf-32'),
        (codecs.BOM_UTF8, 'utf-8-sig'),
        (codecs.BOM_UTF16_LE, 'utf-16'),
        (codecs.BOM_UTF16_BE, 'utf-16'),
        )
    for bom, encoding in bom_to_encoding:
        if first_bytes.startswith(bom):
             return encoding
    return None
Signal15
  • 558
  • 5
  • 16
Ioan Alexandru Cucu
  • 11,981
  • 6
  • 37
  • 39
  • I don't know the answer to your question, but if you end up using your code, you should have a default for files without a BOM (and make sure you've read one). – martineau Nov 27 '12 at 12:38
  • 1
    Have you looked at http://pypi.python.org/pypi/chardet - looks like someone's written a library to do this sort of thing (and probably more extensively) – Jon Clements Nov 27 '12 at 12:39
  • I don't think chardet will help you, according to http://ginstrom.com/scribbles/2008/03/08/using-chardet-to-convert-arbitrary-byte-strings-to-unicode/. just note that if you intend to call the above function a lot consider moving the `bom_to_encoding` map outside of the function. – zenpoy Nov 27 '12 at 12:43
  • @martineau I'd rather return None so then I would know that I need to check for other char-encoding rules (such as '@charset "utf-8"' in css files) – Ioan Alexandru Cucu Nov 27 '12 at 12:48
  • 1
    In that case I recommend you add a `return` or `return None` at the end of the function so people don't think it's an oversight. – martineau Nov 27 '12 at 12:53
  • You know that UTF-8 does not require a BOM, right? – Katriel Nov 27 '12 at 12:53
  • @katrielalex Shouldn't require a BOM. I think an UTF-8 bom does exist so that somebody decoding the file knows that he should use an UTF-8 decoder rather than an ascii one. – Ioan Alexandru Cucu Nov 27 '12 at 13:07

1 Answers1

2

Your code has a subtle bug that you may never be bitten by, but it's best that you avoid it.

You are iterating over a dictionary's keys. The order of iteration is NOT guaranteed by Python. In this case order does matter.

codecs.BOM_UTF32_LE is '\xff\xfe\x00\x00'
codecs.BOM_UTF16_LE is '\xff\xfe'

If your file is encoded in UTF-32LE but UTF-16LE just happens to be tested first, you will incorrectly state that the file is encoded in UTF-16LE.

To avoid this, you can iterate over a tuple that is ordered by BOM-length descending. See sample code in my answer to this question.

Community
  • 1
  • 1
John Machin
  • 81,303
  • 11
  • 141
  • 189