2

I'm trying to create a way to split a string by emoji and non-emoji chunks. I managed to get a regex from here and altered to this to take into account the textual variation selector:

(?:(?!(\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])+\ufe0e))(\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])+

This works with .match such as:

''.match(regex) // (["0x1F1E6", "0x1F1E8"]) => ['']
''.match(regex) // (["0x1F1E6", "0x1F1E8", "0xFE0E]) => null

But split isn't giving me the expected results:

''.split(regex) // (["", undefined, "", ""]) => ['']

I need split to return the entire emoji in one element. What am I doing wrong?

EDIT:

I have a working regex now, except for the edge case exhibited here: https://regex101.com/r/Vki2ZS/2.

I don't want the second emoji to be matched since it is succeeded by the textual variant selector. I think this is because I'm using lookahead, as the reverse string is matched as expected, but I can't use negative look behind since it's not supported by all browsers.

ragurney
  • 424
  • 5
  • 16
  • 1
    Just make all groups non-capturing and if you need to get the matches, too, wrap the whole pattern with a capturing group. – Wiktor Stribiżew Jan 21 '20 at 23:24
  • Doing some more testing, I am thinking it's because of the first non-capturing group, which returns undefined? I found this in the docs: "If separator is a regular expression that contains capturing parentheses, then each time separator is matched, the results (including any undefined results) of the capturing parentheses are spliced into the output array." but how to make the first capture group not return undefined? I'll try out what you suggest Wiktor – ragurney Jan 21 '20 at 23:26
  • 1
    No, use `.split(/(?!(?:\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])+\ufe0e)(?:\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])+/)` – Wiktor Stribiżew Jan 21 '20 at 23:29
  • 1
    Or `.split(/((?!(?:\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])+\ufe0e)(?:\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])+)/)` – Wiktor Stribiżew Jan 21 '20 at 23:29
  • Looks like that second one is the answer. – ragurney Jan 21 '20 at 23:46
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/206468/discussion-between-ragurney-and-wiktor-stribizew). – ragurney Jan 22 '20 at 19:10
  • 1
    Try `s.replace(/(?:(?:\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])+(?!\ufe0e)(?:\ufe0f)?(?:\u200d)?)+/g, '$&')` – Wiktor Stribiżew Jan 23 '20 at 00:05
  • 1
    Sure, thanks again for all of your help. Much appreciated. – ragurney Jan 28 '20 at 18:03

1 Answers1

1

Your pattern does not work because the second emoji got partly matched with the + quantified (?:\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])+: \uD83E\uDD20\uFE0F\uD83E\uDD20 was matched in \uD83E\uDD20\uFE0F\uD83E\uDD20\uFE0E with two iterations, first \uD83E\uDD20\uFE0F, then \uD83E\uDD20.

The pattern you may use with .split is

/((?:(?:\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])+(?!\ufe0e)(?:\ufe0f)?(?:\u200d)?)+)/

The main goal was to fail all matches where (?:\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])+ was followed with \uFE0E, see I added a negative lookahead (?!\ufe0e).

JS demo:

var regex = /((?:(?:\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])+(?!\ufe0e)(?:\ufe0f)?(?:\u200d)?)+)/;
console.log(''.split(regex));
console.log('️︎'.split(regex));

// If you need to wrap the match with some tags:
console.log('️︎'.replace(/(?:(?:\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])+(?!\ufe0e)(?:\ufe0f)?(?:\u200d)?)+/g, '<span class="special">$&</span>'))
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563