why the comparision of two strings in utf8 is not correct?

Question

I have two words and both are of the type std::string and they are unicode words. they are the same, I mean when I write them to some file they both have the same representation. but when I call word1.compare(word2), I dont get the right result. why they are not the same? or should I use another function instead of compare to compare two unicode strings? thanks

ifstream myfile;
    string term = "";
    myfile.open("homograph.txt");   
    istream_iterator<string> i(myfile);
    multiset<string> s(i, istream_iterator<string>());
    for(multiset<string>::const_iterator i = s.begin(); i != s.end(); i = s.upper_bound(*i))
    {           
        term = *i;      

    }


    pugi::xml_document doc;
    std::ifstream stream("words0.xml");
    pugi::xml_parse_result result = doc.load(stream);
pugi::xml_node words = doc.child("Words");

for (pugi::xml_node_iterator it = words.begin(); it != words.end(); ++it)
{       
        std::string wordValue = as_utf8(it->child("WORDVALUE").child_value());
        if(!wordValue.compare(term))
        {
        o << wordValue << endl;
        }
}

the first word is "term" and the second word is wordValue; the overload function of as_utf8() is :

std::string wordNet::as_utf8(const char* str)
{
    return str;
}

What do you mean by "representation"? The same is printed for both strings? Because this means nothing. `std::string` can have `\0` inside and if both strings have it and they are different after the `\0`, it's expected that `compare` will return `false`. Show us some code + example (+ file and how you open/read it). — Kiril Kirov, Aug 22 '11 at 10:57
one of the words is Persian word that I write it to some file and I read it using istream_iterator(file). the other string is the return value of pugixml::child_value() that is basically of type pugi::char_t* and then I convert it to string suing as_utf8 — aliakbarian, Aug 22 '11 at 11:00
possible duplicate of [how can I compare utf8 string such as persian words in c++?](http://stackoverflow.com/questions/7141417/how-can-i-compare-utf8-string-such-as-persian-words-in-c) — Steve Jessop, Aug 22 '11 at 12:17

score 4 · Answer 1 · edited Oct 23 '11 at 22:08

In Unicode (and UTF-8 is Unicode), there is the problem of composition. A token like é can be represented by its own code point, or by the code point e followed by ´. It could be that one is encoded using precomposition (é) and the other using decomposition (e´). Both will usually be displayed the same way. To avoid the problem, one should normalize strings on one of these composition types.

Of course, there could be another problem, but this is one of the problems that can make equal looking strings not compare as equal. OTOH, if your text does not have any characters outside ASCII, this is hardly the problem.

The correct way to compare the strings is to normalize them first. You can do this in Python with the unicodedata module.

The Unicode Standard Technical Appendix #15 describes composition and normalization in detail.

score 3 · Answer 2 · answered Aug 22 '11 at 11:08

3

Unicode is more complicated than you think. There are combining characters, invisible code points and what not. If two strings look the same when printed, it doesn't mean they are byte-to-byte identical.

To take all complications of Unicode into account, you need to use a Unicode-aware string library. One such library is ICU. The C++ standard library is most definitely not Unicode-aware. It probably can correctly count characters in a UTF-8 strings, but that's about it.

answered Aug 22 '11 at 11:08

n. m. could be an AI

112,515
14
128
243

when I put words in text file and save the file as utf8 I have no problem, I think the problem is the function as-utf8() that I dont know really what it returns? – aliakbarian Aug 22 '11 at 11:20
That's right, `as_utf8()` will not magically recode whatever encoding you have at hand to UTF-8. If you want to recode strings to UTF-8, you need to use a library that can do the recoding. – n. m. could be an AI Aug 22 '11 at 11:45

score -4 · Answer 3 · answered Aug 22 '11 at 11:10

-4

Try using std::wstring instead.

answered Aug 22 '11 at 11:10

weekens

8,064
6
45
62

the result of as-utf8() function is string and I cant use the function as-wide() because that takes the different argument – aliakbarian Aug 22 '11 at 11:14
mbstowcs() function for native chars is probably what you need. – weekens Aug 22 '11 at 11:20
2

wstring not magically solve anything. besides that he needs to convert encoding to utf16 or utf32, the caveats for composite glyphs still exists – PlasmaHH Aug 22 '11 at 11:20
my problem is that I cant convert std::string to std:: wstring – aliakbarian Aug 22 '11 at 11:31
@aliakbarian, `std::wstring(mbstowcs(std::string.c_str()))` (well, not exactly, but just to show an idea) – weekens Aug 22 '11 at 12:21
If the strings don't compare equal as UTF-8 on the byte level, they won't be binary equal as UTF-16 or UTF-32, either, since the translation between those is unambiguous. – Christopher Creutzig Aug 22 '11 at 12:38

why the comparision of two strings in utf8 is not correct?

3 Answers3