Distinguishing between string formats

Question

Having an untyped pointer pointing to some buffer which can hold either ANSI or Unicode string, how do I tell whether the current string it holds is multibyte or not?

In silico · Accepted Answer · 2010-12-03T00:09:09.063

Unless the string itself contains information about its format (e.g. a header or a byte order mark) then there is no foolproof way to detect if a string is ANSI or Unicode. The Windows API includes a function called IsTextUnicode() that basically guesses if a string is ANSI or Unicode, but then you run into this problem because you're forced to guess.

Why do you have an untyped pointer to a string in the first place? You must know exactly what and how your data is representing information, either by using a typed pointer in the first place or provide an ANSI/Unicode flag or something. A string of bytes is meaningless unless you know exactly what it represents.

paxdiablo · Answer 2 · 2010-12-03T01:09:43.690

Unicode is not an encoding, it's a mapping of code points to characters. The encoding is UTF8 or UCS2, for example.

And, given that there is zero difference between ASCII and UTF8 encoding if you restrict yourself to the lower 128 characters, you can't actually tell the difference.

You'd be better off asking if there were a way to tell the difference between ASCII and a particular encoding of Unicode. And the answer to that is to use statistical analysis, with the inherent possibility of inaccuracy.

For example, if the entire string consists of bytes less than 128, it's ASCII (it could be UTF8 but there's no way to tell and no difference in that case).

If it's primarily English/Roman and consists of lots of two-byte sequences with a zero as one of the bytes, it's probably UTF16. And so on. I don't believe there's a foolproof method without actually having an indicator of some sort (e.g., BOM).

My suggestion is to not put yourself in the position where you have to guess. If the data type itself can't contain an indicator, provide different functions for ASCII and a particular encoding of Unicode. Then force the work of deciding on to your client. At some point in the calling hierarchy, someone should now the encoding.

Or, better yet, ditch ASCII altogether, embrace the new world and use Unicode exclusively. With UTF8 encoding, ASCII has exactly no advantages over Unicode :-)

ASCII has one advantage over UTF8 and UTF16: The number of bytes == the number of characters. For UTF8/16 you MUST iterator over the entire string to calculate the number of characters even if you are given begin and end. — KitsuneYMG, Dec 03 '10 at 00:33
Why would you care about the number of characters though? It's almost always useless. It has no correspondence to physical display width/columns, and depending on your editor style may not even correspond to the number of times you have to hit backspace to erase the whole string. In this way UTF-8 is a huge help: it forces you to realize and accept that any code that's working with characters and not strings is almost surely **wrong**. — R.. GitHub STOP HELPING ICE, Dec 03 '10 at 00:40
@KitsuneYMG, you have to do that for ASCII data anyway :-) The length of the 'string' `"hello\0world"` is not a simple `end - start` calculation, it's actually `5` and you have to look at every byte to find that out. UTF8 makes it pretty easy since you can tell the difference between a start byte and a continuation byte by the bit pattern alone. — paxdiablo, Dec 03 '10 at 01:07

score 2 · Answer 3 · answered Dec 03 '10 at 00:03

2

In general you can't

You could check for the pattern of zeros - just one at the end probably means ansi 'c', every other byte a zero probably means ansi text as UTF16, 3zeros might be UTF32

answered Dec 03 '10 at 00:03

Martin Beckett

94,801
28
188
263

Distinguishing between string formats

3 Answers3