How to distinguish UTF-8 and ASCII files?

Question

How to distinguish UTF-8 (no BOM) and ASCII files?

http://stackoverflow.com/questions/1031645/how-to-detect-utf-8-in-plain-c — Anders Lindahl, Apr 29 '11 at 10:21
Duplicate: http://stackoverflow.com/questions/4907942/detecting-text-file-type-ansi-vs-utf-8 — Marjan Venema, Apr 30 '11 at 19:35

score 5 · Answer 1 · answered Apr 29 '11 at 13:21

5

If the file contains any bytes with the top bit set, then it is not ASCII.

So if the only possibilities are ASCII or UTF-8, then it's UTF-8.

If the file contains only bytes with the top bit clear, then it's meaningless to distinguish whether it's ASCII or UTF-8, since it represents exactly the same series of characters either way. But you can call it ASCII.

Of course this doesn't distinguish UTF-8 from ISO Latin or CP1252, and neither does it confirm that the so-called UTF-8 is actually valid.

answered Apr 29 '11 at 13:21

Steve Jessop

273,490
39
460
699

1

In the case where the file contains no high bytes, calling it "ASCII" might be worthwhile - for example when giving it a MIME type. This will ensure that broken legacy mail systems which might not know what "UTF-8" means will still accept plain ASCII transmissions. ;-) – R.. GitHub STOP HELPING ICE May 02 '11 at 01:50
Also note that if you do confirm that the file parses as valid UTF-8, this gives you a high degree of certainty that the file actually was intended to be interpreted as UTF-8. The nature of UTF-8 multibyte sequences makes them almost-certainly nonsense when interpreted as legacy codepage data. – R.. GitHub STOP HELPING ICE May 02 '11 at 01:52
Yes, that's what I meant - you can call it ASCII, whereas if any high bits are set then you *can't* call it ASCII. If no high bits are set, then what's meaningless would be to say that it's ASCII *as opposed to* UTF-8 - whether it was originally intended to be UTF-8 or not, in fact it is now and can be treated as such, including running it through your UTF-8 decoder. I probably wasn't very clear. – Steve Jessop May 02 '11 at 11:35
You were clear, I was just adding some info on why it may be useful to call it "ASCII" when it's actually (of course) both ASCII and UTF-8. – R.. GitHub STOP HELPING ICE May 02 '11 at 11:38

score -1 · Accepted Answer · answered Apr 30 '11 at 03:28

-1

http://msdn.microsoft.com/en-us/library/dd318672%28v=vs.85%29.aspx

IsTextUnicode Function Determines if a buffer is likely to contain a form of Unicode text.

answered Apr 30 '11 at 03:28

user536232

645
5
18

How to distinguish UTF-8 and ASCII files?

2 Answers2