13

Is there a way to recognize if text file is UTF-8 in Python?

I would really like to get if the file is UTF-8 or not. I don't need to detect other encodings.

agf
  • 171,228
  • 44
  • 289
  • 238
Riki137
  • 2,076
  • 2
  • 23
  • 26
  • 1
    duplicates? http://stackoverflow.com/questions/436220/python-is-there-a-way-to-determine-the-encoding-of-text-file and http://stackoverflow.com/questions/2144815/how-to-know-the-encoding-of-a-file-in-python – CppLearner Apr 14 '12 at 18:21
  • I was asking to detect UTF-8 (true/false), not every encoding. – Riki137 Apr 14 '12 at 18:27
  • You can guess with a high confident rate, unless you know more about the content of the file you can't be really certain. For example, the type of file (which in this case you are asking for text file). Most of the time you can guess. I've came across this a few times last year that's why :) – CppLearner Apr 14 '12 at 18:30
  • @Riki137 I added some information on detecting UTF-8 if you know the alternatives are single-byte encodings. – agf Apr 15 '12 at 19:14

2 Answers2

24

You mentioned in a comment you only need to detect UTF-8. If you know the alternative consists of only single byte encodings, then there is a solution that often works.

If you know it's either UTF-8 or single byte encoding like latin-1, then try opening it first in UTF-8 and then in the other encoding. If the file contains only ASCII characters, it will end up opened in UTF-8 even if it was intended as the other encoding. If it contains any non-ASCII characters, this will almost always correctly detect the right character set between the two.

try:
    # or codecs.open on Python <= 2.5
    # or io.open on Python > 2.5 and <= 2.7
    filedata = open(filename, encoding='UTF-8').read() 
except:
    filedata = open(filename, encoding='other-single-byte-encoding').read() 

Your best bet is to use the chardet package from PyPI, either directly or through UnicodeDamnit from BeautifulSoup:

chardet 1.0.1

Universal encoding detector

Detects:

  • ASCII, UTF-8, UTF-16 (2 variants), UTF-32 (4 variants)
  • Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN (Traditional and Simplified Chinese)
  • EUC-JP, SHIFT_JIS, ISO-2022-JP (Japanese)
  • EUC-KR, ISO-2022-KR (Korean)
  • KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251 (Cyrillic)
  • ISO-8859-2, windows-1250 (Hungarian)
  • ISO-8859-5, windows-1251 (Bulgarian)
  • windows-1252 (English)
  • ISO-8859-7, windows-1253 (Greek)
  • ISO-8859-8, windows-1255 (Visual and Logical Hebrew)
  • TIS-620 (Thai)

Requires Python 2.1 or later

However, some files will be valid in multiple encodings, so chardet is not a panacea.

twasbrillig
  • 17,084
  • 9
  • 43
  • 67
agf
  • 171,228
  • 44
  • 289
  • 238
3

Reliably? No.

In general, a byte sequence has no meaning unless you know how to interpret it -- this goes for text files, but also integers, floating point numbers, etc.

But, there are ways of guessing the encoding of a file, by looking at the byte order mark (if there is one) and the first chunk of the file (to see which encoding yields the most sensible characters). The chardet library is pretty good at this, but be aware it's only a heuristic, albeit a rather powerful one.

Cameron
  • 96,106
  • 25
  • 196
  • 225