Javascript regular expression for searching word boundaries in Unicode string

Question

Is there solution to find word boundaries in Japanese string (E.g.: "私はマーケットに行きました。") via JavaScript regular expressions("xregexp" JS library cab be used)?

E.g.:

var xr = RegExp("\\bst","g");
xr.test("The string") // --> true

I need the same logic for Japanese strings.

A way to match the boundaries between Han, Hiragana, and Katakana would assist but not solve this problem on its own. So far I can't even find a way to match those, even with xregexp. You may be interested in a question I just asked about that: http://stackoverflow.com/questions/16492933/regular-expression-to-match-boundary-between-different-unicode-scripts — hippietrail, May 11 '13 at 01:47
For Japanese it would be better to use a full morphological analyzer. Here's one in JavaScript: https://github.com/takuyaa/kuromoji.js — katspaugh, Oct 01 '15 at 08:56

score 6 · Accepted Answer · answered Oct 28 '11 at 11:19

6

However, the actual problem of separating the Japanese sentence into words is more complicated than it appears, since words are not separated into spaces as is the case, for example, in English.

For example, the sentence 私はマーケットに行きました。 ("I went to the market") has the following words:

私 - watakushi
は - wa
マーケット - maaketto
に - ni
行きました - ikimashita
。 - (period)

A reliable parser of Japanese sentences would, among other things, have to find where the particles (wa and ni) lie in the sentence, in order to find the remaining words.

answered Oct 28 '11 at 11:19

Peter O.

32,158
14
82
96

3

Yes, this is really hard; you have to have big dictionaries of words, and heuristics for guessing what words are more likely to be meant when a sequence of characters (especially kana) are used. It's possible to make puns where you could read a sentence in more than one way, so ultimately the task is not completely solvable, and there's very little you can do with tools as blunt as regex (never mind JavaScript's Unicode-ignorant regexps). – bobince Oct 28 '11 at 13:40

katspaugh · Answer 2 · 2011-10-28T10:23:54.783

4

\b, as well as \w and \W, isn't Unicode-aware in JavaScript. You have to define your word boundaries as a specific character set. Like (^|$|[\s.,:\u3002]+) or similar.

\u3002 is from ('。'.charCodeAt(0)).toString(16). Is it a punctuation symbol in Japanese?

Or, a contrario, define a Unicode range of word-constructing letters and negate it:

var boundaries = /(^|$|\s+|[^\u30A0–\u30FA]+)/g;

The example katakana range taken from http://www.unicode.org/charts/PDF/U30A0.pdf.

edited Oct 28 '11 at 10:23

answered Oct 28 '11 at 10:08

katspaugh

17,449
11
66
103

I think yes. '。' is a punctuation symbol – Andrei Oct 28 '11 at 10:21
1

Yes, it is a full stop, one of the few reliable ways of splitting at word (sentence) boundaries. Doing better than that is very hard (as per Peter's answer). – bobince Oct 28 '11 at 13:36

Javascript regular expression for searching word boundaries in Unicode string

2 Answers2