1

There is a piece of JavaScript code with several regexes using \b to match word boundaries. They don't work as expected on Unicode input. They work more or less like this:

',.,.Michał /#@$^Øystein(*()'.match(/\b.+?\b/g) // => ["Micha", "ł /#@$^Ø", "ystein"]

I would like the expression above to return [ "Michał", " /#@$^", "Øystein" ].

The expressions inside \b are actually more complicated than .+? and some of them are generated, so changing them is quite tricky. Ideally, I would like to keep these expressions unchanged, and substitute \b with something that matches zero-width word boundaries in a Unicode-aware way.

Is it possible at all? If it is, how can I do it? If it is not, how can I do it in a way that requires least changes to the expressions inside \b?

I hoped ES6 could help, but it won't – the behaviour of \b hasn't been changed there.

kxmh42
  • 3,121
  • 1
  • 25
  • 15

0 Answers0