How to correctly determine character encoding of text files?

Question

Here is my situation: I need to correctly determine which character encoding is used for given text file. Hopefully, it can correctly return one of the following types:

enum CHARACTER_ENCODING
{
    ANSI,
    Unicode,
    Unicode_big_endian,
    UTF8_with_BOM,
    UTF8_without_BOM
};

Up to now, I can correctly tell a text file is Unicode, Unicode big endian or UTF-8 with BOM by calling the following function. It also can correctly determine for ANSI if the given text file is not originally a UTF-8 without BOM. The problem is that when the text file is UTF-8 without BOM, the following function will mistakenly regard it as a ANSI file.

CHARACTER_ENCODING get_text_file_encoding(const char *filename)
{
    CHARACTER_ENCODING encoding;

    unsigned char uniTxt[] = {0xFF, 0xFE};// Unicode file header
    unsigned char endianTxt[] = {0xFE, 0xFF};// Unicode big endian file header
    unsigned char utf8Txt[] = {0xEF, 0xBB};// UTF_8 file header

    DWORD dwBytesRead = 0;
    HANDLE hFile = CreateFile(filename, GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
    if (hFile == INVALID_HANDLE_VALUE)
    {
        hFile = NULL;
        CloseHandle(hFile);
        throw runtime_error("cannot open file");
    }
    BYTE *lpHeader = new BYTE[2];
    ReadFile(hFile, lpHeader, 2, &dwBytesRead, NULL);
    CloseHandle(hFile);

    if (lpHeader[0] == uniTxt[0] && lpHeader[1] == uniTxt[1])// Unicode file
        encoding = CHARACTER_ENCODING::Unicode;
    else if (lpHeader[0] == endianTxt[0] && lpHeader[1] == endianTxt[1])//  Unicode big endian file
        encoding = CHARACTER_ENCODING::Unicode_big_endian;
    else if (lpHeader[0] == utf8Txt[0] && lpHeader[1] == utf8Txt[1])// UTF-8 file
        encoding = CHARACTER_ENCODING::UTF8_with_BOM;
    else
        encoding = CHARACTER_ENCODING::ANSI;   //Ascii

    delete []lpHeader;
    return encoding;
}

This problem has blocked me for a long time and I still cannot find a good solution. Any hint will be appreciated.

The term "ANSI" is often incorrectly used to refer to an 8-bit encoding, typically one of the Windows-specific ones such as Windows-1252, which never became an ANSI standard. The term "Unicode" is often, in the Microsoft world, incorrectly used to refer to the UTF-16 encoding; Unicode is not an encoding, but there are several encodings that can be used to represent Unicode. An ASCII file is indistinguishable from a UTF-8 file that doesn't happen to contain any characters outside the range 0..127. Most UTF-8 files do not start with a BOM (since UTF-8 has no byte order). — Keith Thompson, Dec 23 '13 at 16:39
Instead of enumerating the encoding types in a comment, enumerate them in an `enum`. — Casey, Dec 23 '13 at 20:33

score 8 · Accepted Answer · answered Dec 23 '13 at 16:08

8

For starters, there's no such physical encoding as "Unicode". What you probably mean by this is UTF-16. Secondly, any file is valid in "ANSI", or any single-byte encoding for that matter. The only thing you can do is guess in the best order which is most likely to throw out invalid matches.

You should check, in this order:

Is there a UTF-16 BOM at the beginning? Then it's probably UTF-16. Use the BOM as indicator whether it's big endian or little endian, then check the rest of the file whether it conforms.
Is there a UTF-8 BOM at the beginning? Then it's probably UTF-8. Check the rest of the file.
If the above didn't result in a positive match, check if the entire file is valid UTF-8. If it is, it's probably UTF-8.
If the above didn't result in a positive match, it's probably ANSI.

If you expect UTF-16 files without BOM as well (it's possible for, for example, XML files which specify the encoding in the XML declaration), then you have to shove that rule in there as well. Though any of the above may produce a false positive, falsely identifying an ANSI file as UTF-* (though it's unlikely). You should always have metadata that tells you what encoding a file is in, detecting it after the fact is not possible with 100% accuracy.

answered Dec 23 '13 at 16:08

deceze

510,633
85
743
889

I just noticed, in Notepad++, there is no `UTF-16`. Instead, it has two more types: `UCS-2 Big Endian` and `UCS-2 Little Endian`. So is `UTF-16` equivalent to `UCS-2` here? – herohuyongtao Dec 23 '13 at 16:18
1

Nope, UCS-2 is an older Unicode encoding which is rarely used anymore. UTF-16 is UTF-16, but typically just mislabeled as "Unicode" by Microsoft and related products. – deceze Dec 23 '13 at 16:25
1

That's because it used to be called Unicode before it switched to 32 bit code points. Microsoft adopted it before the standard was set and a lot of the functions and documentation have this original name. – codekaizen Dec 13 '17 at 20:06
@codekaizen Unicode is limited to a little over 20 bits, it's own backward compatibility requirements prevent it from growing over this size because they decided to support UTF-16 which is incapable of encoding anything over 0x10FFFF. Support for this primitive UTF-16 encoding is why unicode will never assign a character in the UTF-16 surrogate range of U+D800/U+DFFF. WHY unicode would voluntarily choose to restrict its self from future growth is beyond me, and in my opinion indicates a likely lack of intelligence on their part. – user3338098 Nov 11 '20 at 23:08

How to correctly determine character encoding of text files?

1 Answers1

Linked