-1

Is there a library that can be used to check if a file is UTF 8 or UTF 16? I found this http://utfcpp.sourceforge.net/ but it is in C++ and for a variety of reasons, I am not allowed to use C++ in the software that I am workign on.Thanks for any inputs

doon
  • 2,311
  • 7
  • 31
  • 52
  • 1
    You want a library which... is not C++. *What language is it allowed to be*? – jalf Jul 29 '14 at 20:15
  • possible duplicate of [How to detect UTF-8 in plain C?](http://stackoverflow.com/questions/1031645/how-to-detect-utf-8-in-plain-c) – n0p Jul 30 '14 at 14:18
  • Are you assuming it is Unicode? If so, why don't you also have the encoding as metadata/context? If not, there are many other possibilities. Every file is valid CP437. A BOM (UTF-32LE, UTF-32BE, UTF-16LE, UTF-16BE, UTF-8 or UTF-7) is valid Windows-1252. – Tom Blodget Jul 31 '14 at 05:06
  • How come you don't know the encoding of your data files? – jalf Jul 31 '14 at 07:43

1 Answers1

0

You don't need a library; you should be able to make a guess from the first couple of bytes of the file.

If there's a BOM (the codepoint U+feff) at the beginning of the file, then you can use it to sniff the encoding as follows.

  • 00 00 FE FF -> UTF-32, big-endian
  • FF FE 00 00 -> UTF-32, little-endian
  • FE FF -> UTF-16, big-endian
  • FF FE -> UTF-16, little-endian
  • EF BB BF -> UTF-8

from the Unicode FAQ.

If you know, or can reasonably assume, that the file starts off with ASCII, then you can distinguish UTF-8 from UTF-16 by looking at the first couple of bytes. If the file starts off with <?xml... (for example!), then:

  • 00 00 00 3C -> UTF-32, big endian
  • 3C 00 00 00 -> UTF-32, little endian
  • 00 3C 00 3F -> UTF-16, big endian
  • 3C 00 3F 00 -> UTF-16, little endian
  • 3C 3F 78 6D -> UTF-8

If you don't know the text at the beginning, but do know it's ASCII, then the pattern of null bytes will be the same.

If the file doesn't reliably start with ASCII, then it starts to get intricate. But...

The best way, though, in terms of generality and reliability, is probably to start parsing the file with a UTF-whatever decoder, and see what works. In fact, since that's surely what you're going to do anyway, you might as well do that, and skip the messy business of sniffing at the file.

(This is surely a duplicate, but I can't find a question which completely matches it)

Edited to note that files don't necessarily start with BOMs, but that it's still possible to sniff in some circumstances.

Norman Gray
  • 11,978
  • 2
  • 33
  • 56
  • 2
    The byte order mark is optional. You can't rely on it being present – jalf Jul 29 '14 at 20:15
  • True, true: downvote deserved. I've edited the answer to generalise it. – Norman Gray Jul 30 '14 at 18:59
  • Technically, the file may start with non-ASCII beginning, which in some cases means differentiating between UTF-16LE and BE can be tricky. Fortunately, UTF-32 and UTF-8 are easy to detect. – Karol S Jul 30 '14 at 20:27
  • Indeed, so almost the only way of detecting the encoding, in the fully general case, is by spotting things which would be an error in one or other encoding, for example a UTF-16 surrogate (which would also indicate the endedness), or a UTF-8 `10xxxxxx` octet in the wrong place. Hence the most pragmatic advice may be, as suggested, to just try decoding and see what happens. – Norman Gray Jul 31 '14 at 13:05