how can I compare utf8 string such as persian words in c++?

Question

I want to compare strings in Persian (utf8). I know I must use some thing like L"گل" and it must be saved in wchar_t * or wstring. the question is when I compare by the function compare() strings I dont get the right result.

Do you mean compare for equality, or compare for the purpose of sorting, or just what? — Karl Knechtel, Aug 21 '11 at 21:58

hmakholm left over Monica · Answer 1 · 2011-08-21T22:54:53.207

3

wchar_t is not for UTF-8, but (depending on the platform) typically either UTF-16 or UCS-32. If you want to work on UTF-8, use plain old char * or string, and their comparison functions for equality. If you want human-meaingful sorting, it gets much more involved (no matter which encoding you use).

edited Aug 21 '11 at 22:54

answered Aug 21 '11 at 22:22

hmakholm left over Monica

23,074
3
51
73

The String.Compare operates on two String, and String does not have constructor from wchar, so most likely you are constrcting from your wchar as a char in error, and you are hitting a null termination early, and hence why your compare fails -- if you operate with UTF-8 you can store everything as char and everything should work fine EXCEPT that "greater than" and "less than" will give you problems, but you may have had problems with those in wchar as well... – Soren Aug 21 '11 at 22:36
1

Note that **any** Unicode encoding, including UTF-8, 16 or 32 cannot be compared byte-wise for anything other than byte-equality. The display may be identical, but the bytes used (such as R->L markers, multi-codepoint display modifiers, and similar used in non-English languages such as Persian) will not be. – Yann Ramin Aug 21 '11 at 22:53
@Yann Ramin: That's why the Unicode collation algorithm handles normalization and default ignorables. I often get myself a collator object with the right strength levels set and then call its equality method so I don't have to worry about Unicode's funny ideas of equal inequalities or inequal equalities or such. – tchrist Aug 22 '11 at 03:54

score 3 · Answer 2 · answered Aug 21 '11 at 22:57

Unicode is notoriously difficult to compare.

Note that any Unicode encoding, including UTF-8, 16 or 32 cannot be compared byte-wise for anything other than byte-equality. The display may be identical, but the bytes used (such as R->L markers, surrogate pairs, display modifiers, and similar used in non-English languages such as Persian) will not be.

Generally, you need to normalize Unicode before you can make a realistic comparison if the meaning of the text has any significance:

http://userguide.icu-project.org/transforms/normalization

_Text_ is notoriously difficult to compare. ASCII cheats by ignoring 95% of all text in the world. — MSalters, Aug 22 '11 at 08:46

score 2 · Accepted Answer · edited May 23 '17 at 12:30

If the strings that you want to compare are in a specific, definite encoding already, then don't use wchar_t and don't use L"" literals -- those are not for Unicode, but for implementation-defined, opaque encodings only.

If your strings are in UTF-8, use a string of chars. If you want to convert them to raw Unicode codepoints (UCS-4/UTF-32), or if you already have them in that form, store them in a string of uint32_ts, or char32_ts if you have a modern compiler.

If you have C++11, your literal can be char str8[] = u8"گل"; or char32_t str32[] = U"گل";. See this topic for some more on this.

If you want to interact with command line arguments or the environment, use iconv() to convert from WCHAR to UTF-32 or UTF-8.

how can I compare utf8 string such as persian words in c++?

3 Answers3

Linked