Javascript string comparison fails when comparing unicode characters

Question

I want to compare two strings in JavaScript that are the same, and yet the equality operator == returns false. One string contains a special character (eg. the danish å).

JavaScript code:

var filenameFromJS = "Designhåndbog.pdf";
var filenameFromServer = "Designhåndbog.pdf";

print(filenameFromJS == filenameFromServer); // This prints false why?

The solution What worked for me is unicode normalization as slevithan pointed out.

I forked my original jsfiddle to make a version using the normalization lib suggested by slevithan. Link: http://jsfiddle.net/GWZ8j/1/.

See this article about `==` vs. `===` http://stackoverflow.com/questions/359494/javascript-vs-does-it-matter-which-equal-operator-i-use — Steve, May 29 '12 at 19:53
@Steve When both operands are of the same type, it does not matter if you use loose or strict comparison. — PointedEars, May 29 '12 at 20:01
This is also very useful: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ (What every developer needs to know about unicode and character sets) — GrahamMc, Mar 29 '18 at 10:48

slevithan · Accepted Answer · 2012-05-29T21:52:41.230

14

Unlike what some other people here have said, this has nothing to do with encodings. Rather, your two strings use different code points to render the same visual characters.

To solve this correctly, you need to perform Unicode normalization on the two strings before comparing them. Unforunately, JavaScript doesn't have this functionality built in. Here is a JavaScript library that can perform the normalization for you: https://github.com/walling/unorm

edited May 29 '12 at 21:52

answered May 29 '12 at 20:03

slevithan

1,394
13
20

1

Oh, I was hoping not to get this answer :-) That I was just missing the obvious and wouldn't need a library for this simple task. Thanks for the answer I'll give it a try. – tougher May 29 '12 at 20:21
You are right, I have missed that `CC 8A` is the UTF-8 code sequence for `U+30A COMBINING RING ABOVE`, which is preceded by `a`. The other string has `C3 A5` which encodes `U+00E5 LATIN SMALL LETTER A WITH RING ABOVE` in UTF-8. IIRC, Mac OS prefers the combining characters, while other OSes prefer the single-glyph form. It should be possible to have the server convert either one, though, so there is no large client-side library necessary. – PointedEars May 29 '12 at 21:47
PointedEars, that's not necessarily possible or ideal. E.g., you might not want to do a server round trip just to perform a string comparison, or you might be using JavaScript on the server. @Tougher ,There is a proposal to add Unicode normalization to future versions of JavaScript. See [strawman:unicode_normalization](http://wiki.ecmascript.org/doku.php?id=strawman:unicode_normalization). – slevithan May 30 '12 at 03:56
1

There is now a [String#normalize()](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize) method natively available in JS. – Kaiido Apr 20 '22 at 09:14

score 6 · Answer 2 · answered Oct 29 '13 at 03:17

6

The JavaScript equality operator == will appear to be failing under the following circumstances. In all cases it is programmer error. Not a bug in JavaScript.

The two strings do not contain the same number and sequence of characters.
There is whitespace or newlines before, within or after one string. Use a trim() operator on both and look closely at both strings.
Surprise typecasting. The programmer is comparing datatypes that are incompatible.
There are unicode characters which look identical to other unicode characters but in fact are different unicode characters.

answered Oct 29 '13 at 03:17

Eric Leschinski

146,994
96
417
335

+1, because this answer is way more informative than the accepted one and doesn't contain something with nodeJS or jQuery. – unexist Feb 21 '14 at 16:13
in this case number 4 was the culprit – vahanpwns Aug 21 '15 at 18:56
Different unicode normalisation is not about different characters, but means different unicode code point sequences were used to refer to the same character. – James Jan 02 '18 at 22:14

score 0 · Answer 3 · edited May 23 '17 at 11:46

0

UTF-8 is a complex thing. The charset has two different codes for characters such as á, é etc. As you already see in the URL encoded version, the HEX bytes of which the character is made differ for both versions.

See this answer for more information.

edited May 23 '17 at 11:46

Community

1
1

answered May 29 '12 at 19:54

user2428118

7,935
4
45
72

JFTR: Unicode is _not_ UTF-8. Unicode is a standard for a character set and several encodings; UTF-8 is one of those encodings. – PointedEars May 29 '12 at 20:02
Now you are saying that UTF-8 was a character set. It is not. I am also rather certain that your premise is false: a UTF-8 code sequence may not begin with 0xCC. – PointedEars May 29 '12 at 20:12
You're right, I should have called it "encoding", as it appears (http://www.w3.org/TR/html4/charset.html). The HTML code is `` (HTML5) or `` however, so that's somewhat misleading. – user2428118 May 29 '12 at 20:24
Yes, I guess we will have to live with that mistake from the early Internet drafts (I'm talking RFC 822 and friends here) for a long time to come. – PointedEars May 29 '12 at 21:14
I was wrong about 0xCC. [Richard Ishida's excellent Unicode tools](http://www.rishida.net/tools/conversion/) proved it. – PointedEars May 29 '12 at 21:49

score 0 · Answer 4 · answered Aug 06 '17 at 21:12

I had this same problem.

Adding

<meta charset="UTF-8">

to the HTML file fixed the issue.

In my case the templating engine was baking a json string into the HTML file. This string was in unicode.

While the template was also a unicode file, the JS engine was treating the string I wrote into the template as a latin-1 encoded string, until I added the meta tag.

I was comparing the typed in string to one of the JSON objects items (location.title == "Mühle")

score 0 · Answer 5 · answered Aug 11 '21 at 14:53

Let the browser normalize unicode for you. This approach worked for me:

function normalizeUnicode(s) {
    let div = $('<div style="display: none"></div>').html(s).appendTo('body');
    let res = div.html();
    div.remove();
    return res;
}

normalizeUnicode(unicodeVal1) == normalizeUnicode(unicodeVal2)

Javascript string comparison fails when comparing unicode characters

5 Answers5

Linked

Related