17

I want to compare two strings in JavaScript that are the same, and yet the equality operator == returns false. One string contains a special character (eg. the danish å).

JavaScript code:

var filenameFromJS = "Designhåndbog.pdf";
var filenameFromServer = "Designhåndbog.pdf";

print(filenameFromJS == filenameFromServer); // This prints false why?

The solution What worked for me is unicode normalization as slevithan pointed out.

I forked my original jsfiddle to make a version using the normalization lib suggested by slevithan. Link: http://jsfiddle.net/GWZ8j/1/.

Eric Leschinski
  • 146,994
  • 96
  • 417
  • 335
tougher
  • 499
  • 1
  • 3
  • 13
  • See this article about `==` vs. `===` http://stackoverflow.com/questions/359494/javascript-vs-does-it-matter-which-equal-operator-i-use – Steve May 29 '12 at 19:53
  • 4
    @Steve When both operands are of the same type, it does not matter if you use loose or strict comparison. – PointedEars May 29 '12 at 20:01
  • This is also very useful: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ (What every developer needs to know about unicode and character sets) – GrahamMc Mar 29 '18 at 10:48

5 Answers5

14

Unlike what some other people here have said, this has nothing to do with encodings. Rather, your two strings use different code points to render the same visual characters.

To solve this correctly, you need to perform Unicode normalization on the two strings before comparing them. Unforunately, JavaScript doesn't have this functionality built in. Here is a JavaScript library that can perform the normalization for you: https://github.com/walling/unorm

slevithan
  • 1,394
  • 13
  • 20
  • 1
    Oh, I was hoping not to get this answer :-) That I was just missing the obvious and wouldn't need a library for this simple task. Thanks for the answer I'll give it a try. – tougher May 29 '12 at 20:21
  • You are right, I have missed that `CC 8A` is the UTF-8 code sequence for `U+30A COMBINING RING ABOVE`, which is preceded by `a`. The other string has `C3 A5` which encodes `U+00E5 LATIN SMALL LETTER A WITH RING ABOVE` in UTF-8. IIRC, Mac OS prefers the combining characters, while other OSes prefer the single-glyph form. It should be possible to have the server convert either one, though, so there is no large client-side library necessary. – PointedEars May 29 '12 at 21:47
  • PointedEars, that's not necessarily possible or ideal. E.g., you might not want to do a server round trip just to perform a string comparison, or you might be using JavaScript on the server. @Tougher ,There is a proposal to add Unicode normalization to future versions of JavaScript. See [strawman:unicode_normalization](http://wiki.ecmascript.org/doku.php?id=strawman:unicode_normalization). – slevithan May 30 '12 at 03:56
  • 1
    There is now a [String#normalize()](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize) method natively available in JS. – Kaiido Apr 20 '22 at 09:14
6

The JavaScript equality operator == will appear to be failing under the following circumstances. In all cases it is programmer error. Not a bug in JavaScript.

  1. The two strings do not contain the same number and sequence of characters.

  2. There is whitespace or newlines before, within or after one string. Use a trim() operator on both and look closely at both strings.

  3. Surprise typecasting. The programmer is comparing datatypes that are incompatible.

  4. There are unicode characters which look identical to other unicode characters but in fact are different unicode characters.

Eric Leschinski
  • 146,994
  • 96
  • 417
  • 335
  • +1, because this answer is way more informative than the accepted one and doesn't contain something with nodeJS or jQuery. – unexist Feb 21 '14 at 16:13
  • in this case number 4 was the culprit – vahanpwns Aug 21 '15 at 18:56
  • Different unicode normalisation is not about different characters, but means different unicode code point sequences were used to refer to the same character. – James Jan 02 '18 at 22:14
0

UTF-8 is a complex thing. The charset has two different codes for characters such as á, é etc. As you already see in the URL encoded version, the HEX bytes of which the character is made differ for both versions.

See this answer for more information.

Community
  • 1
  • 1
user2428118
  • 7,935
  • 4
  • 45
  • 72
  • JFTR: Unicode is _not_ UTF-8. Unicode is a standard for a character set and several encodings; UTF-8 is one of those encodings. – PointedEars May 29 '12 at 20:02
  • Now you are saying that UTF-8 was a character set. It is not. I am also rather certain that your premise is false: a UTF-8 code sequence may not begin with 0xCC. – PointedEars May 29 '12 at 20:12
  • You're right, I should have called it "encoding", as it appears (http://www.w3.org/TR/html4/charset.html). The HTML code is `` (HTML5) or `` however, so that's somewhat misleading. – user2428118 May 29 '12 at 20:24
  • Yes, I guess we will have to live with that mistake from the early Internet drafts (I'm talking RFC 822 and friends here) for a long time to come. – PointedEars May 29 '12 at 21:14
  • I was wrong about 0xCC. [Richard Ishida's excellent Unicode tools](http://www.rishida.net/tools/conversion/) proved it. – PointedEars May 29 '12 at 21:49
0

I had this same problem.

Adding

<meta charset="UTF-8">

to the HTML file fixed the issue.

In my case the templating engine was baking a json string into the HTML file. This string was in unicode.

While the template was also a unicode file, the JS engine was treating the string I wrote into the template as a latin-1 encoded string, until I added the meta tag.

I was comparing the typed in string to one of the JSON objects items (location.title == "Mühle")

Daniel F
  • 13,684
  • 11
  • 87
  • 116
0

Let the browser normalize unicode for you. This approach worked for me:

function normalizeUnicode(s) {
    let div = $('<div style="display: none"></div>').html(s).appendTo('body');
    let res = div.html();
    div.remove();
    return res;
}

normalizeUnicode(unicodeVal1) == normalizeUnicode(unicodeVal2)