3

How to distinguish UTF-8 (no BOM) and ASCII files?

user536232
  • 645
  • 5
  • 18

2 Answers2

5

If the file contains any bytes with the top bit set, then it is not ASCII.

So if the only possibilities are ASCII or UTF-8, then it's UTF-8.

If the file contains only bytes with the top bit clear, then it's meaningless to distinguish whether it's ASCII or UTF-8, since it represents exactly the same series of characters either way. But you can call it ASCII.

Of course this doesn't distinguish UTF-8 from ISO Latin or CP1252, and neither does it confirm that the so-called UTF-8 is actually valid.

Steve Jessop
  • 273,490
  • 39
  • 460
  • 699
  • 1
    In the case where the file contains no high bytes, calling it "ASCII" might be worthwhile - for example when giving it a MIME type. This will ensure that broken legacy mail systems which might not know what "UTF-8" means will still accept plain ASCII transmissions. ;-) – R.. GitHub STOP HELPING ICE May 02 '11 at 01:50
  • Also note that if you do confirm that the file parses as valid UTF-8, this gives you a high degree of certainty that the file actually was intended to be interpreted as UTF-8. The nature of UTF-8 multibyte sequences makes them almost-certainly nonsense when interpreted as legacy codepage data. – R.. GitHub STOP HELPING ICE May 02 '11 at 01:52
  • Yes, that's what I meant - you can call it ASCII, whereas if any high bits are set then you *can't* call it ASCII. If no high bits are set, then what's meaningless would be to say that it's ASCII *as opposed to* UTF-8 - whether it was originally intended to be UTF-8 or not, in fact it is now and can be treated as such, including running it through your UTF-8 decoder. I probably wasn't very clear. – Steve Jessop May 02 '11 at 11:35
  • You were clear, I was just adding some info on why it may be useful to call it "ASCII" when it's actually (of course) both ASCII and UTF-8. – R.. GitHub STOP HELPING ICE May 02 '11 at 11:38
-1

http://msdn.microsoft.com/en-us/library/dd318672%28v=vs.85%29.aspx

IsTextUnicode Function Determines if a buffer is likely to contain a form of Unicode text.

user536232
  • 645
  • 5
  • 18