I want to compare strings in Persian (utf8). I know I must use some thing like L"گل" and it must be saved in wchar_t * or wstring. the question is when I compare by the function compare() strings I dont get the right result.
-
Do you have C++11 (e.g. GCC 4.6)? – Kerrek SB Aug 21 '11 at 21:58
-
1Do you mean compare for equality, or compare for the purpose of sorting, or just what? – Karl Knechtel Aug 21 '11 at 21:58
-
compare for equality actually – aliakbarian Aug 21 '11 at 22:03
-
and I am working on windows xp visual studio 2008 – aliakbarian Aug 21 '11 at 22:04
3 Answers
wchar_t
is not for UTF-8, but (depending on the platform) typically either UTF-16 or UCS-32. If you want to work on UTF-8, use plain old char *
or string
, and their comparison functions for equality. If you want human-meaingful sorting, it gets much more involved (no matter which encoding you use).

- 23,074
- 3
- 51
- 73
-
The String.Compare operates on two String, and String does not have constructor from wchar, so most likely you are constrcting from your wchar as a char in error, and you are hitting a null termination early, and hence why your compare fails -- if you operate with UTF-8 you can store everything as char and everything should work fine EXCEPT that "greater than" and "less than" will give you problems, but you may have had problems with those in wchar as well... – Soren Aug 21 '11 at 22:36
-
1Note that **any** Unicode encoding, including UTF-8, 16 or 32 cannot be compared byte-wise for anything other than byte-equality. The display may be identical, but the bytes used (such as R->L markers, multi-codepoint display modifiers, and similar used in non-English languages such as Persian) will not be. – Yann Ramin Aug 21 '11 at 22:53
-
@Yann Ramin: That's why the Unicode collation algorithm handles normalization and default ignorables. I often get myself a collator object with the right strength levels set and then call its equality method so I don't have to worry about Unicode's funny ideas of equal inequalities or inequal equalities or such. – tchrist Aug 22 '11 at 03:54
Unicode is notoriously difficult to compare.
Note that any Unicode encoding, including UTF-8, 16 or 32 cannot be compared byte-wise for anything other than byte-equality. The display may be identical, but the bytes used (such as R->L markers, surrogate pairs, display modifiers, and similar used in non-English languages such as Persian) will not be.
Generally, you need to normalize Unicode before you can make a realistic comparison if the meaning of the text has any significance:

- 32,895
- 3
- 59
- 82
-
7_Text_ is notoriously difficult to compare. ASCII cheats by ignoring 95% of all text in the world. – MSalters Aug 22 '11 at 08:46
If the strings that you want to compare are in a specific, definite encoding already, then don't use wchar_t
and don't use L""
literals -- those are not for Unicode, but for implementation-defined, opaque encodings only.
If your strings are in UTF-8, use a string of char
s. If you want to convert them to raw Unicode codepoints (UCS-4/UTF-32), or if you already have them in that form, store them in a string of uint32_t
s, or char32_t
s if you have a modern compiler.
If you have C++11, your literal can be char str8[] = u8"گل";
or char32_t str32[] = U"گل";
. See this topic for some more on this.
If you want to interact with command line arguments or the environment, use iconv()
to convert from WCHAR to UTF-32 or UTF-8.