How to distinguish UTF-8 (no BOM) and ASCII files?
Asked
Active
Viewed 1,010 times
3
-
http://stackoverflow.com/questions/1031645/how-to-detect-utf-8-in-plain-c – Anders Lindahl Apr 29 '11 at 10:21
-
2All ASCII files are also UTF-8 files. :) – tchrist Apr 29 '11 at 13:40
-
Duplicate: http://stackoverflow.com/questions/4907942/detecting-text-file-type-ansi-vs-utf-8 – Marjan Venema Apr 30 '11 at 19:35
2 Answers
5
If the file contains any bytes with the top bit set, then it is not ASCII.
So if the only possibilities are ASCII or UTF-8, then it's UTF-8.
If the file contains only bytes with the top bit clear, then it's meaningless to distinguish whether it's ASCII or UTF-8, since it represents exactly the same series of characters either way. But you can call it ASCII.
Of course this doesn't distinguish UTF-8 from ISO Latin or CP1252, and neither does it confirm that the so-called UTF-8 is actually valid.

Steve Jessop
- 273,490
- 39
- 460
- 699
-
1In the case where the file contains no high bytes, calling it "ASCII" might be worthwhile - for example when giving it a MIME type. This will ensure that broken legacy mail systems which might not know what "UTF-8" means will still accept plain ASCII transmissions. ;-) – R.. GitHub STOP HELPING ICE May 02 '11 at 01:50
-
Also note that if you do confirm that the file parses as valid UTF-8, this gives you a high degree of certainty that the file actually was intended to be interpreted as UTF-8. The nature of UTF-8 multibyte sequences makes them almost-certainly nonsense when interpreted as legacy codepage data. – R.. GitHub STOP HELPING ICE May 02 '11 at 01:52
-
Yes, that's what I meant - you can call it ASCII, whereas if any high bits are set then you *can't* call it ASCII. If no high bits are set, then what's meaningless would be to say that it's ASCII *as opposed to* UTF-8 - whether it was originally intended to be UTF-8 or not, in fact it is now and can be treated as such, including running it through your UTF-8 decoder. I probably wasn't very clear. – Steve Jessop May 02 '11 at 11:35
-
You were clear, I was just adding some info on why it may be useful to call it "ASCII" when it's actually (of course) both ASCII and UTF-8. – R.. GitHub STOP HELPING ICE May 02 '11 at 11:38
-1
http://msdn.microsoft.com/en-us/library/dd318672%28v=vs.85%29.aspx
IsTextUnicode Function Determines if a buffer is likely to contain a form of Unicode text.

user536232
- 645
- 5
- 18