utf-8 word boundary regex in javascript

Question

In JavaScript:

"ab abc cab ab ab".replace(/\bab\b/g, "AB");

correctly gives me:

"AB abc cab AB AB"

When I use utf-8 characters though:

"αβ αβγ γαβ αβ αβ".replace(/\bαβ\b/g, "AB");

the word boundary operator doesn't seem to work:

"αβ αβγ γαβ αβ αβ"

Is there a solution to this?

JavaScript doesn't use `UTF-8` for Unicode. According to the standard an implementation may use either `UCS-2` or `UTF-16` I believe. This means either you are operating on text that has been converted to one of these formats, or you could be operating on text where each "octet" (byte) of each Unicode codepoint has been converted to one of these formats, depending on how your code gets the text. — hippietrail, Nov 19 '12 at 13:04

Gumbo · Accepted Answer · 2010-05-21T11:17:44.633

29

The word boundary assertion does only match if a word character is not preceded or followed by another word character (so .\b. is equal to \W\w and \w\W). And \w is defined as [A-Za-z0-9_]. So \w doesn’t match greek characters. And thus you cannot use \b for this case.

What you could do instead is to use this:

"αβ αβγ γαβ αβ αβ".replace(/(^|\s)αβ(?=\s|$)/g, "$1AB")

edited May 21 '10 at 11:17

answered May 21 '10 at 11:06

Gumbo

643,351
109
780
844

thanks. The use of the lookahead (?=...) notation looks interesting as well. Could this be done without it? – cherouvim May 22 '10 at 05:09
3

@cherouvim: No, it would consume the space after the word that is then the start for the next lookup. So just looking at `"αβ αβ"`, the first match would consume `"αβ |αβ"` (`|` indicates the internal pointer) and the last part would not be matched because there is no leading space left. But since the look-ahead assertion does not consume characters, the position of the pointer after the first match will be `"αβ| αβ"` and the leading space is preserved for the next match. – Gumbo May 22 '10 at 06:51
1

This is not quite the same as a word boundary. It does not match `αβ!` for instance. – R. Martinho Fernandes Apr 13 '11 at 13:45
@R.MartinhoFernandes please try my answer out as I'm needing some more folks to bang on it for my own selfish needs but it turns out it will help you as a side effect. – King Friday Mar 13 '13 at 05:18
See also: http://breakthebit.org/post/3446894238/word-boundaries-in-javascripts-regular-expressions – Noyo Oct 15 '13 at 11:44
This version is good but does not support hyphens. Maybe a unicode range based expression would be optimal, but quite a job ... I hope they reimplement \b to be more useful. Java version has unicode support ... – Eirik Birkeland Jan 06 '17 at 19:11

King Friday · Answer 2 · 2013-03-13T06:46:14.987

I needed something to be programmable and handle punctuation, brackets, etc.

http://jsfiddle.net/AQvyd/

var wordToReplace = '買い手',
    replacementWord = '[[BUYER]]',
    text = 'Mange 買い手 information. The selected Store and Classification will be the default on the สั่งซื้อ.'

function replaceWord(text, wordToReplace, replacementWord) {
    var re = new RegExp('(^|\\s|\\(|\'|"|,|;)' + wordToReplace + '($|\\s|\\)|\\.|\'|"|!|,|;|\\?)', 'gi');
    return text.replace(re, replacementWord);
}

I've written a javascript resource editor so this is why I've found this page and also answered it out of necessity since I couldn't find a word boundary parametarized regexp that worked well for Unicode.

Actually, I should be escaping the "wordToReplace" with "\" in reserved characters. I'll have to update that. — King Friday, Jul 18 '14 at 21:28

Sean Kinsey · Answer 3 · 2010-05-21T11:23:36.370

2

Not all Javascript regexp implementation has support for Unicode ad so you need to escape it

"αβ αβγ γαβ αβ αβ".replace(/\u03b1\u03b2/g, "AB"); // "AB ABγ γAB AB AB"

For mapping the characters you can take a look at http://htmlhelp.com/reference/html40/entities/symbols.html

Of course, this doesn't help with the word boundary issue (as explained in other answers) but should at least enable you to match the characters properly

edited May 21 '10 at 11:23

answered May 21 '10 at 11:18

Sean Kinsey

37,689
7
52
71

Then why don’t you use the same Unicode escapes for the string as well? – Gumbo May 21 '10 at 11:26
Because one is parsed as a string, and one as a literal RegExp - I'm not sure if it matters though.. – Sean Kinsey May 21 '10 at 11:34
3

But if the regular expression implementation does not support Unicode, how is a Unicode escape sequence like `\u03b1` supposed to be interpreted? – Gumbo May 21 '10 at 13:38

score 0 · Answer 4 · answered May 21 '10 at 11:06

0

Not all the implementations of RegEx associated with Javascript engines a unicode aware.

For example Microsofts JScript using in IE is limited to ANSI.

answered May 21 '10 at 11:06

AnthonyWJones

187,081
35
232
306

score 0 · Answer 5 · edited May 23 '17 at 12:09

0

When you’re dealing with Unicode and natural-language words, you probably want to be more careful with boundaries than just using \b. See this answer for details and directions.

edited May 23 '17 at 12:09

Community

1
1

answered Nov 18 '10 at 13:40

tchrist

78,834
30
123
180

utf-8 word boundary regex in javascript

5 Answers5

Linked

Related