Strange UTF8 string comparison

Question

I'm having this problem with UTF8 string comparison which I really have no idea about and it starts to give me headache. Please help me out.
Basically I have this string from a xml document encoded in UTF8: 'Mina Tidigare anställningar'
And when I compare that string with the exactly the same string which I typed myself: 'Mina Tidigare anställningar' (also in UTF8). And the result is FALSE!!!
I have no idea why. It is so strange. Can someone help me out?

Under no circumstances show us any actual code. It would take all the suspense away! And... somebody could accidentally come up with a solution! — Pekka, Sep 03 '10 at 14:09
'Mina Tidigare anställningar' is a special value like NaN, that is not equal to itself. :-p — LarsH, Sep 03 '10 at 14:11
What about, you're comparing apples to bananas? (ASCII / UTF8) — Lekensteyn, Sep 03 '10 at 14:14
When I copied the codes to the browser and then copied the codes from the browser to the editor, the comparison returns TRUE! @-@ I'm gonna bang my head to the wall soon. — James, Sep 03 '10 at 14:14
@James in that case, you are most likely actually working with two different encodings, that get auto-converted when copying them across. — Pekka, Sep 03 '10 at 14:15
@James How do you know your browser sends it to your server in UTF-8 ? Maybe it does some translation. Or maybe there's some translation occuring when you read the xml document. So. Show some code that reproduces the behavior, or at the very least explain where you're getting your string from (database ? html form ? text file ?) and where you're getting your xml document from ) — nos, Sep 03 '10 at 14:19
@Lekensteyn: You mean "ISO-8859-1 to UTF-8"? ASCII doesn't have a representation of `ä`, IIRC. — Piskvor left the building, Sep 03 '10 at 15:01

score 23 · Accepted Answer · edited May 23 '17 at 12:34

23

This seems somewhat relevant. To simplify, there are several ways to get the same text in Unicode (and therefore UTF8): for example, this: ř can be written as one character ř or as two characters: r and the combining ˇ.

Your best bet would be the normalizer class - normalize both strings to the same normalization form and compare the results.

In one of the comments, you show these hex representations of the strings:

4d696e61205469646967617265 20   616e7374 c3a4   6c6c6e696e676172  // from XML
4d696e61205469646967617265 c2a0 616e7374 61cc88 6c6c6e696e676172 // typed
        ^^-----------------^^^^1         ^^^^^^2

Note the parts I marked, apparently there are two parts to this problem.

For the first, observe this question on the meaning of byte sequence "c2a0" - for some reason, your typing is translated to a non-breakable space where the XML file has a normal space. Note that there's a normal space in both cases after "Mina". Not sure what to do about that in PHP, except to replace all whitespace with a normal space.
As to the second, that is the case I outlined above: c3a4 is ä (U+00E4 "LATIN SMALL LETTER A WITH DIAERESIS" - one character, two bytes), whereas 61 is a (U+0061 "LATIN SMALL LETTER A" - one character, one byte) and cc88 would be the combining umlaut " (U+0308 "COMBINING DIAERESIS" - two characters, three bytes). Here, the normalization library should be useful.

edited May 23 '17 at 12:34

Community

1
1

answered Sep 03 '10 at 14:17

Piskvor left the building

91,498
46
177
222

1

In this case, a Unicode-aware string comparison library should be able to understand that c3a4 == 61cc88. However I doubt it would consider your non-breaking space to be equal to a normal space. Unless you told it to ignore differences between whitespace. You would need to ask your text editor, browser, or wherever you typed the space, why it translated it to nbsp. – LarsH Sep 03 '10 at 14:50
@LarsH: With emphasis on the *should* - PHP internally works with bytes, not characters, so I assume you'd have to do `Normalizer::normalize($string1) == Normalizer::normalize($string2)`, or normalize the strings when you load them. – Piskvor left the building Sep 03 '10 at 14:57
@Piskvor: Right... I wasn't trying to imply that PHP's internal string-comparison routines are Unicode-aware. – LarsH Sep 03 '10 at 15:38
1

@LarsH: Even worse - most of PHP's internal functions operate on bytes (I could live with that), but some operate on characters, where the charset is apparently influenced by the phase of the moon (it's somewhere deep in php.ini, and I suspect slight bugginess in some cases). If you can help it, don't do anything with strings in PHP beyond concatenation, and even then be careful. – Piskvor left the building Sep 03 '10 at 15:54
1

@Piskvor That's not accurate. That are some functions which depend on the locale. Unfortunately, the manual sometimes omits this information... – Artefacto Sep 03 '10 at 20:58
Thank Piskvor. I have installed intl extension and used Normalizer class to sove the problem. : D – James Sep 09 '10 at 13:18

score 2 · Answer 2 · answered Sep 03 '10 at 14:15

2

Let's try blindly: maybe both UTF-8 strings have not the same underlying representation (you can get characters with accents as a sequence or as a unique character). You should give use some hex dump of both UTF8 strings and someone may be able to help.

answered Sep 03 '10 at 14:15

kriss

23,497
17
97
116

Hej hej kriss, thank you. This is the hex dump of the str from xml file '4d696e6120546964696761726520616e7374c3a46c6c6e696e676172'. And this is of the string I typed myself '4d696e61205469646967617265c2a0616e737461cc886c6c6e696e676172'. – James Sep 03 '10 at 14:19
Obviously they are different... problem seems to be in the string you typed yourself. In the xml string you get 20 (space) but in your file c2a0 (whatever ? I should decode). But obviously it's not the same. – kriss Sep 03 '10 at 15:23

score 0 · Answer 3 · answered Sep 03 '10 at 14:18

0

mb_detect_encoding($s, "UTF-8") == "UTF-8" ? : $s = utf8_encode($s);

answered Sep 03 '10 at 14:18

DmitryK

5,542
1
22
32

Strange UTF8 string comparison

3 Answers3

Linked