0

I'm trying to match some text based on a query that the user inputs. After encountering some issues, I found out this rather odd behaviour of String.indexOf that I simply cannot understand:

If I try to match a query without diacritics against a string with diacritics, it works: (not sure why)

"brezzel cu brânză".indexOf("bra")

11

But matching the same string with another letter after it, doesn't work:

"brezzel cu brânză".indexOf("bran")

-1

(tested both in Chrome & Firefox, same behaviour)

Is this a documented behaviour that I'm unaware of or what exactly is happening here?

iuliu.net
  • 6,666
  • 6
  • 46
  • 69
  • `a` is not equal to `â`.. `brân` is in the string but `bran` is not in the string – The Bomb Squad Nov 08 '20 at 22:48
  • in case these chars look the same to you(your display is STRANGE), run some code in a js console.. `console.log("a"=="â")` – The Bomb Squad Nov 08 '20 at 22:51
  • 1
    `Array.from("brânză")` reveals what exactly is in the string. – Sebastian Simon Nov 08 '20 at 22:52
  • 3
    That "a" character in your source string is comprised of the normal latin "a" plus the "combining circumflex accent" character, Unicode code point 770 (decimal). – Pointy Nov 08 '20 at 22:54
  • Related: [Javascript - normalize accented greek characters](https://stackoverflow.com/q/23346506/4642212) and [Seemingly identical strings fail comparison](https://stackoverflow.com/q/16799810/4642212). – Sebastian Simon Nov 08 '20 at 23:00

1 Answers1

2

If I remember correctly, js characters are encoded in 2 bytes. But many other unicode chars encoded 4 bytes. Now the char â is 4 bytes. The first 2 bytes is a, thats why the first case works. Use the escape function to see:

escape("brezzel cu brânză")
"brezzel%20cu%20bra%u0302nza%u0306"

see that %20 is space, followed by bra and then you have %u0302 which together with previous a, encodes .

Probably you can tell the rest. Test it if you want to:

'a' + String.fromCharCode('0x0302') //â
ibrahim tanyalcin
  • 5,643
  • 3
  • 16
  • 22