0

This code appears to work with "normal" characters, but not with those outside the Basic Multilingual Plane.

Why does this not work, and is there a way to make it work?

let s = "⛵️"
let unicodeArray = [...s]

console.log(unicodeArray.slice(1, 2)) // ["⛵"] // correct
console.log(unicodeArray.slice(1, 3)) // ["⛵", "️"] // incorrect
Ben Aston
  • 53,718
  • 65
  • 205
  • 331
  • JavaScript represents Unicode with UTF-16, and most of the string operations don't understand the implications. – Pointy Feb 07 '20 at 15:28
  • Fine. But I am deliberately using the spread syntax to create the array in a BMP-aware fashion. Where is the breakage? – Ben Aston Feb 07 '20 at 15:34
  • 2
    The "empty" character between the second and "third" symbol is the problem – Andreas Feb 07 '20 at 15:40

2 Answers2

2

The problem is that in your string, the ⛵️ is two separate codepoints: the sailboat emoji (U+26F5) and a variation selector (U+FE0F). Your unicodeArray has a length of 4, leading to more substrings.

If you omit the variation selector, it works as selected:

const s1 = "abc"
const s2 = "⛵️" // length 6
const s3 = "⛵" // length 5
console.log(s2 === s3) // false

function substrings(s) {
    const unicodeArray = Array.from(s)
    const result = []

    for (let l = 1; l <= unicodeArray.length; l++) {
      for (let i = 0; i <= unicodeArray.length - l; i++) {
        result.push(unicodeArray.slice(i, i + l).join(''))
      }
    }
    return result
}

console.log(substrings(s1)) // ["a", "b", "c", "ab", "bc", "abc"]
console.log(substrings(s2)) // ["", "⛵", "️", "", "⛵", "⛵️", "️", "⛵️", "⛵️", "⛵️"]
console.log(substrings(s3)) // ["", "⛵", "", "⛵", "⛵️", "⛵️"]
Bergi
  • 630,263
  • 148
  • 957
  • 1,375
  • But surely the "sailboat character" consists of both the emoji and the variation selector, and both should therefore be assigned to index two in the array. Or no? – Ben Aston Feb 07 '20 at 15:47
  • 1
    See [What characters are grouped with Array.from?](https://stackoverflow.com/q/60053160/1048572) and the related questions. The string iterator only splits codepoints, while you seem to be looking for [grapheme clusters](https://mathiasbynens.be/notes/javascript-unicode#other-grapheme-clusters). – Bergi Feb 07 '20 at 15:57
0

Because the length of those characters confuses your function

console.log("⛵️".length); // 6
console.log("abc".length);     // 3
TKoL
  • 13,158
  • 3
  • 39
  • 73
  • Why though? I am splitting the string into characters in a BMP compliant manner (`[...s]`). Nowhere am I referring to the length of the string. Deliberately. – Ben Aston Feb 07 '20 at 15:31
  • You literally referred to the length before you edited your post. I can see the edit history... – TKoL Feb 07 '20 at 15:45
  • Yes... the array that you made out of the string... `[...s]` – TKoL Feb 07 '20 at 16:05
  • Lose the snark. That string is not the string under question anyway, so... I don't even understand how the goalposts got moved to there anyway. – TKoL Feb 07 '20 at 16:21
  • I'm glad you can now see the error you made. Good conversation – TKoL Feb 07 '20 at 16:24
  • `[...'⛵️'].length === 4` and `'⛵️'.length === 6`. Note the array and spread syntax totally changes the result. Never in my question (check the history), did I directly use the length property on a string. If I had, then I might have accepted your answer as correct. – Ben Aston Feb 07 '20 at 16:42