C library to detect if a file is UTF 8 OR UTF 16

Question

Is there a library that can be used to check if a file is UTF 8 or UTF 16? I found this http://utfcpp.sourceforge.net/ but it is in C++ and for a variety of reasons, I am not allowed to use C++ in the software that I am workign on.Thanks for any inputs

You want a library which... is not C++. *What language is it allowed to be*? — jalf, Jul 29 '14 at 20:15
possible duplicate of [How to detect UTF-8 in plain C?](http://stackoverflow.com/questions/1031645/how-to-detect-utf-8-in-plain-c) — n0p, Jul 30 '14 at 14:18
Are you assuming it is Unicode? If so, why don't you also have the encoding as metadata/context? If not, there are many other possibilities. Every file is valid CP437. A BOM (UTF-32LE, UTF-32BE, UTF-16LE, UTF-16BE, UTF-8 or UTF-7) is valid Windows-1252. — Tom Blodget, Jul 31 '14 at 05:06

Norman Gray · Answer 1 · 2014-07-30T18:59:12.147

You don't need a library; you should be able to make a guess from the first couple of bytes of the file.

If there's a BOM (the codepoint U+feff) at the beginning of the file, then you can use it to sniff the encoding as follows.

00 00 FE FF -> UTF-32, big-endian
FF FE 00 00 -> UTF-32, little-endian
FE FF -> UTF-16, big-endian
FF FE -> UTF-16, little-endian
EF BB BF -> UTF-8

from the Unicode FAQ.

If you know, or can reasonably assume, that the file starts off with ASCII, then you can distinguish UTF-8 from UTF-16 by looking at the first couple of bytes. If the file starts off with <?xml... (for example!), then:

00 00 00 3C -> UTF-32, big endian
3C 00 00 00 -> UTF-32, little endian
00 3C 00 3F -> UTF-16, big endian
3C 00 3F 00 -> UTF-16, little endian
3C 3F 78 6D -> UTF-8

If you don't know the text at the beginning, but do know it's ASCII, then the pattern of null bytes will be the same.

If the file doesn't reliably start with ASCII, then it starts to get intricate. But...

The best way, though, in terms of generality and reliability, is probably to start parsing the file with a UTF-whatever decoder, and see what works. In fact, since that's surely what you're going to do anyway, you might as well do that, and skip the messy business of sniffing at the file.

(This is surely a duplicate, but I can't find a question which completely matches it)

Edited to note that files don't necessarily start with BOMs, but that it's still possible to sniff in some circumstances.

The byte order mark is optional. You can't rely on it being present — jalf, Jul 29 '14 at 20:15
True, true: downvote deserved. I've edited the answer to generalise it. — Norman Gray, Jul 30 '14 at 18:59
Technically, the file may start with non-ASCII beginning, which in some cases means differentiating between UTF-16LE and BE can be tricky. Fortunately, UTF-32 and UTF-8 are easy to detect. — Karol S, Jul 30 '14 at 20:27
Indeed, so almost the only way of detecting the encoding, in the fully general case, is by spotting things which would be an error in one or other encoding, for example a UTF-16 surrogate (which would also indicate the endedness), or a UTF-8 `10xxxxxx` octet in the wrong place. Hence the most pragmatic advice may be, as suggested, to just try decoding and see what happens. — Norman Gray, Jul 31 '14 at 13:05

C library to detect if a file is UTF 8 OR UTF 16

1 Answers1