7

A problem with various character encodings is that the containing file is not always clearly marked. There are inconsistent conventions for marking some using "byte-order-markers" or BOMs. But in essence you have to be told what the file encoding is, to read it accurately.

We build programming tools that read source files, and this gives us grief. We have means to specify defaults, and sniff for BOMs, etc. And we do pretty well with conventions and defaults. But a place we (and I assume everybody else) gets hung up on are UTF-8 files that are not BOM-marked.

Recent MS IDEs (e.g., VS Studio 2010) will apparently "sniff" a file to determine if it is UTF-8 encoded without a BOM. (Being in the tools business, we'd like to be compatible with MS because of their market share, even if it means having to go over the "stupid" cliff with them.) I'm specifically interested in what they use as a heuristic (although discussions of heuristics is fine)? How can it be "right"? (Consider an ISO8859-x encoded string interpreted this way).

EDIT: This paper on detecting character encodings/sets is pretty interesting: http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html

EDIT December 2012: We ended scanning the entire file to see if it contained any violations of UTF-8 sequences... and if it does not, we call it UTF-8. The bad part of this solution is you have to process the characters twice if it is UTF-8. (If it isn't UTF-8, this test is likely to determine that fairly quickly, unless the file happens to all 7 bit ASCII at which point reading like UTF-8 won't hurt).

pnuts
  • 58,317
  • 11
  • 87
  • 139
Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • Attempting to read the entire file as UTF-8 would either be "successful" or encounter invalid byte sequences. At some point I thought I saw an article talking about the liklyhood of false-positives, but I cannot relocate it. –  Jul 13 '12 at 22:37
  • This answer (http://stackoverflow.com/a/4522251/120163) claims some tiny "false-positive" rate for pretty short character sequences. I'm trying to decide if I understand/beleive it. – Ira Baxter Jul 13 '12 at 22:57
  • ... the above answer seems to assume a random, flat distribution of characters drawn from the Unicode set, which I highly suspect of being wrong, so I conclude argument for tiny false positive rates is wrong. (It may still be tiny). – Ira Baxter Jul 16 '12 at 07:10
  • 2
    No ISO-8859-x file that ever has one non-ASCII character surrounded by ASCII will ever be valid UTF-8. Most two-byte non-ASCII sequences aren't valid UTF-8. There are a few examples of real-life strings that could get misinterpreted as UTF-8, but it would be somewhat unlikely for a whole file to only have those strings. – prosfilaes Jul 16 '12 at 07:12
  • If you can process the whole file, why not to check it for valid utf-8 encoding? If it is, most likely it is really utf-8. – Nickolay Olshevsky Dec 05 '12 at 22:11
  • @Nickolay: That's what we ended up doing. I'm not happy about it, because you might have to read a couple of million characters, just so you can go back and read the couple of million characters again. That seems pretty pointless. Yes, I know about buffering. :-} – Ira Baxter Dec 05 '12 at 23:29
  • You can read it once and check for compatibility with utf8, utf-16 (BE/LE), and fill the frequency tables for the 1-byte encodings, you'd like to support :) – Nickolay Olshevsky Dec 06 '12 at 08:54

3 Answers3

8

If encoding is UTF-8, the first character you see over 0x7F must be the start of a UTF-8 sequence. So test it for that. Here is the code we use for that:

unc ::IsUTF8(unc *cpt)
{
    if (!cpt)
        return 0;

    if ((*cpt & 0xF8) == 0xF0) { // start of 4-byte sequence
        if (((*(cpt + 1) & 0xC0) == 0x80)
         && ((*(cpt + 2) & 0xC0) == 0x80)
         && ((*(cpt + 3) & 0xC0) == 0x80))
            return 4;
    }
    else if ((*cpt & 0xF0) == 0xE0) { // start of 3-byte sequence
        if (((*(cpt + 1) & 0xC0) == 0x80)
         && ((*(cpt + 2) & 0xC0) == 0x80))
            return 3;
    }
    else if ((*cpt & 0xE0) == 0xC0) { // start of 2-byte sequence
        if ((*(cpt + 1) & 0xC0) == 0x80)
            return 2;
    }
    return 0;
}

If you get a return of 0, it is not valid UTF-8. Else skip the number of chars returned and continue checking the next one over 0x7F.

McDowell
  • 107,573
  • 31
  • 204
  • 267
Jeremy Griffith
  • 306
  • 2
  • 7
  • We did something essentially equivalent to this. Thanks for the detailed response, though. – Ira Baxter Feb 16 '13 at 18:21
  • 1
    I'd add a forth case for plain ascii: `else if (*cpt & 0x80 == 0x00) return 1;` – wildplasser Feb 16 '13 at 18:25
  • You didn't answer the question as to what MS does, but I suspect I'm unlikely to get an answer. You did provide a simple mechanism for checking. I don't think its complete, because it will accept some non-Unicode sequences (not all combinations are valid), but its pretty good as a hueristic. So, I'm giving your answer the benefit. – Ira Baxter Mar 04 '13 at 18:38
  • @Jeremy Griffith I converter in Java method [isUTF at this way](http://pastebin.com/c27wfZYK) ==> As result doesn't work this part if ((buffer[0] & 0xF8) == 0xF0) { (and currentaFile 100% with good coding) Why does this happen? What is wrong? How does solve this problem? – catch23 Mar 07 '13 at 21:58
2

Visual Studio Code uses jschardet, which returns a guess and a confidence level. It's all open source, so you can inspect the code.

https://github.com/microsoft/vscode/issues/101930#issuecomment-655565813

jedmao
  • 10,224
  • 11
  • 59
  • 65
1

we just found a solution to this Basically, when you don't know the encoding of a file/stream/source you need to check the entire file and/or look for portions of texts to see if you get UTF-8 matches. I see this similar to what some antiviral products does, checking for portions of known viral sub-strings

Maybe I'd suggest you apply call to a function similar to what we did when reading the file/stream, line by line to determine whether UTF-8 encoding is found or not

Please refer to our post below

Ref. - https://stackoverflow.com/questions/17283872/how-to-detect-utf-8-based-encoded-strings

Community
  • 1
  • 1
  • You didn't read my EDIT Dec 2012 note carefully. That's exactly what I said, and we did. You don't get to process portions; you get to process the whole thing to decide. (What does it mean to read it line-by-line if you haven't determined the encoding yet?) – Ira Baxter Jun 24 '13 at 21:26
  • Good you did the same like we did and what we explain in our post. The reason of read by portions depends on the usage; i.e. if I am making a scraper and I have to display portions of what I scrape in a listview I don't need to detect the whole scraped HTML I get, but just the portion of text I want to display in a grid/control. The need to use a function like what we did is 'cause you can't UTF decode something that's already UTF decoded. i.e. DecodeUTF8("Societé") will return something like Societ¿ which is wrong. This is why you first need detecting if the string is SocietÈ – Diego Sendra Jun 24 '13 at 23:21
  • Also, we found not reliable enough just determining the encoding of a file, either by reading its BOM header or by checking its UTF-8 declaration in case of an HTML, i.e. you could be reading from a database and you don't know if it's UTF or not, the same applies if someone just copy/paste text in a textarea or a textbox and your encoding was not specifically expecting/coded to hold UTF based data. I hope the function we did in our post help others. Regards, Diego – Diego Sendra Jun 24 '13 at 23:29
  • 1
    Fundamentally you can't win this war. Since some bit strings can be UTF-8 as well as EBCDIC, the only way you can actually know is to be told. There's two ways to be told: 1) metadata outside the test container (easily and often lost), and 2) metadata marking the file (BOM's etc. or file attributes). But people seem to hate BOM-style markers. What's left is chaos, which is what we have, and well-deserved as a community. IQ appears not to be additive. – Ira Baxter Jun 24 '13 at 23:53
  • So what do you suggest? Because, eventually, we all have to deal with a string that "we don't know" the source and/or are not be told which encoding it has. Asking myself this question is why I came up to this function. I know it effectively detects punctuation marks and symbols in most common languages, I can't guarantee it could detect other really uncommon symbols, but áéíóú, basic àèìòù, äëïöü and others as well as symbols are detected by my function. I get your point, but there's something we have to do or at least rely on a workaround somehow – Diego Sendra Jun 25 '13 at 00:34
  • "What's left is chaos, which is what we have, and well-deserved as a community" -> This sounds dramatic, but so true. We lack standards, most things are non-standardized. I am saying this since 20 years now, not only UTF-8 related. – Diego Sendra Jun 25 '13 at 00:37
  • Just think of this as full employment for software engineers, patching what people did in the past, and making a bigger heap to patch. I worry that all the patches eventually become fractal and then we make no progress in the limit. Anyway, +1 for chiming in. – Ira Baxter Jun 25 '13 at 00:42