6

I am trying to write code to compare two string. In windows i can use strcmp but i want write for multibyte character string so that it compatible to all other platform Can i use memcmp? if no then is there any other API i can use or i need to write my own API.

Suri
  • 3,287
  • 9
  • 45
  • 75

3 Answers3

5

You have to be careful. I'm not an expert on Unicode/multi byte encodings, but I know that with diacritics sometimes two strings can be considered equal when their bytes are not exactly the same. It's recommended to use pre-tested APIs, because string encodings can get pretty messy.

See the old new thing on case mapping. I can't think of a reference for the diacritics but if I do I'll post it.

parkovski
  • 1,503
  • 10
  • 13
  • This is correct. For some cases, a `memcmp` will work. For 100% correctness, and especially if Unicode in any form is involved, `memcmp` will not work. Even simple characters like `é` can be represented more than one way--either as `é` (one Unicode character), or as `´` combined with `e` (in two Unicode characters). Most of the time, these don't get mixed and matched, so you might not see any problems at first, but eventually it will bite you. – StilesCrisis Feb 27 '12 at 06:38
  • Another way in which strings could be 'considered' equal, but not byte-equal is if your comparison is case invariant. In this case you need to perform what is termed case folding, which allows comparison of upper case, lower case, title case, and case invariant glyphs (which, as stated above could be in memory represented as multiple code points... or not). – Bingo Feb 27 '12 at 06:44
  • Equal after normalization is not the same thing as equal. That's the whole point of normalization. OP was asking whether two strings strings are equal, not whether they are equivalent. – Ted Hopp Feb 27 '12 at 06:57
  • @Bingo: Case handling is worse. In Turkish the upper case of `i` is not `I`, it's `İ` (`I` with a dot above it) and the lower case of `I` isn't `i`, it's `ı` (dotless `i`), in which case you need to know the language in which a word is written. :) – Alexey Frunze Feb 27 '12 at 07:39
  • Here's a reference on the various Unicode normalization types (various ways that a character can be encoded). http://unicode.org/reports/tr15/#Introduction Note that UTF8 specifically requires the shortest-possible encoding for characters, but this is specific to UTF8, AFAIK--other types of Unicode are more lenient. – StilesCrisis Feb 27 '12 at 14:41
2

If the two strings are using the same encoding, you can use memcmp. If they are using UTF-8 and your strings don't contain the NULL character (U+0000), you could even use strcmp, since, in the absence of NULL itself, 0 does not appear in UTF-8 encoded strings. Another option is to convert your strings to wide characters using mbstowcs.

Ted Hopp
  • 232,168
  • 48
  • 399
  • 521
  • This will have false negatives--two identical strings can be encoded into different byte patterns. You need to compare with a Unicode savvy function. – StilesCrisis Feb 27 '12 at 06:40
  • @StilesCrisis - Can you provide an example of how identical strings can have different UTF-8 encodings? Or, for that matter, how this could happen with any other signle encoding (like ISO 8859-1)? I did make the point that the strings needed to be using the same encoding. – Ted Hopp Feb 27 '12 at 06:56
  • 1
    @Ted Hopp : With UTF-8, you may encode a character in overlong-form (a sequence that decodes to a value that should use a shorter sequence : this sentence is from wikipedia). In this case, memcmp returns wrong answer but UTF-8 aware compare function returns the right answer... – Malkocoglu Feb 27 '12 at 07:50
  • 2
    @Malkocoglu - As of Unicode version 3.0, the standard forbids the generation of non-shortest form UTF-8 sequences. (It's conformance clause C12 in the standard.) A string encoded with an overlong form is not using legal UTF-8 encoding. (The same Wikipedia page lists "overlong form" under the section [Invalid byte sequences](http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences).) – Ted Hopp Feb 27 '12 at 08:06
  • @Ted Hopp : If you use memcmp/strcmp for ill-formed UTF8 strings, they will return OK as if they were valid sequences. If you use an UTF8 aware compare function, it will/must return error if either of the strings is ill-formed. This was my point, I am against ill-formed UTF8 too... – Malkocoglu Feb 27 '12 at 09:21
  • "0 does not appear in UTF-8 encoded strings." This is wrong. The UTF-8 encoding of the code point 0 is 0x00 (one byte). – Sebastian Ullrich Jun 18 '18 at 11:39
  • @SebastianUllrich - Good point. I had overlooked that. I'll update my answer. – Ted Hopp Jun 18 '18 at 13:04
1

If the strings both use the same encoding, memcmp will work fine. Keep in mind that wide characters are different sizes on different platforms, however.

If the strings use different encodings, you will need a library such as ICU to deal with it.

Collin Dauphinee
  • 13,664
  • 1
  • 40
  • 71