10

I have a strange problem that I can't explain. I'm trying to manipulate a string with an accent as "é". This string comes from the name of an image from an input file type.

What I can not understand is why my string when I parse with for the accented character is split into two character. Here is an example to better understand:

My é is divided into two character like this e & ́.

"é".length
=> 2

It's possible that utf8 is involved ?

I really don't understand anything at all !

hypee
  • 718
  • 5
  • 20

2 Answers2

12

They are called Combining Diacritical Marks. They are a "piece" of Unicode... Some combinable diacritics that can be "chained" on any character. Clearly the length of the string in that case is 2 (because there is the e and the '. The precomposed characters like àéèìòù have been left for compatibility, but now any character can be accented :-) Clearly 99% of the programmers don't know it, and 99.9% of the programs support it very badly. I'm quite sure they could be used as an attack vector somewhere (but I'm not paranoid :-) )

I'll even add that even Skeet in 2009 wasn't sure on how they worked: http://codeblog.jonskeet.uk/2009/11/02/omg-ponies-aka-humanity-epic-fail/

You see, I couldn't remember whether combining characters came before or after base characters

:-) :-)

schnaader
  • 49,103
  • 10
  • 104
  • 136
xanatos
  • 109,618
  • 12
  • 197
  • 280
9

Instead of UTF-8, it's more likely combining diacritical marks involved.

>>> "e\u0301"
"é"
>>> "e\u0301".length
2

Javascript's strings are usually encoded as UTF-16, so it could contain the whole single "é" (U+00e9) in 1 code unit.


But characters outside of the BMP (those with code point beyond U+FFFF) will return 2, as they are encoded into 2 UTF-16 code units.

>>> "".length
2
kennytm
  • 510,854
  • 105
  • 1,084
  • 1,005
  • 1
    See also [wikipedia](http://en.wikipedia.org/wiki/Combining_character) about that. – rekire Sep 02 '13 at 17:32
  • To make it more clear for the OP, the second part (the one about the BMP) is orthogonal to the combining diacritical mark. They are different independent things. Each code point of Unicode can be represented by "something" that uses one Javascript character or 2 Javascript characters. On top of this you can "mount" Combining Diacritical Marks (0...n, with n quite big), so that a rendered grapheme could be composed of 1-x Javascript characters, with x > 10 :-) Aaah... I wanted to make it clear and it became 5 rows! :-( – xanatos Sep 02 '13 at 17:51