Unicode-aware equivalent of \b in JavaScript regexes

Asked Jan 10 '17 at 22:24

Active Jan 10 '17 at 22:31

Viewed 22 times

There is a piece of JavaScript code with several regexes using \b to match word boundaries. They don't work as expected on Unicode input. They work more or less like this:

',.,.Michał /#@$^Øystein(*()'.match(/\b.+?\b/g) // => ["Micha", "ł /#@$^Ø", "ystein"]

I would like the expression above to return [ "Michał", " /#@$^", "Øystein" ].

The expressions inside \b are actually more complicated than .+? and some of them are generated, so changing them is quite tricky. Ideally, I would like to keep these expressions unchanged, and substitute \b with something that matches zero-width word boundaries in a Unicode-aware way.

Is it possible at all? If it is, how can I do it? If it is not, how can I do it in a way that requires least changes to the expressions inside \b?

I hoped ES6 could help, but it won't – the behaviour of \b hasn't been changed there.

edited Jan 10 '17 at 22:31

asked Jan 10 '17 at 22:24

kxmh42

3,121
1
25
15

Unicode-aware equivalent of \b in JavaScript regexes

0 Answers0