There is a piece of JavaScript code with several regexes using \b
to match word boundaries. They don't work as expected on Unicode input. They work more or less like this:
',.,.Michał /#@$^Øystein(*()'.match(/\b.+?\b/g) // => ["Micha", "ł /#@$^Ø", "ystein"]
I would like the expression above to return [ "Michał", " /#@$^", "Øystein" ]
.
The expressions inside \b
are actually more complicated than .+?
and some of them are generated, so changing them is quite tricky. Ideally, I would like to keep these expressions unchanged, and substitute \b
with something that matches zero-width word boundaries in a Unicode-aware way.
Is it possible at all? If it is, how can I do it? If it is not, how can I do it in a way that requires least changes to the expressions inside \b
?
I hoped ES6 could help, but it won't – the behaviour of \b
hasn't been changed there.