4

Is there solution to find word boundaries in Japanese string (E.g.: "私はマーケットに行きました。") via JavaScript regular expressions("xregexp" JS library cab be used)?

E.g.:

var xr = RegExp("\\bst","g");
xr.test("The string") // --> true

I need the same logic for Japanese strings.

hippietrail
  • 15,848
  • 18
  • 99
  • 158
Andrei
  • 4,237
  • 3
  • 25
  • 31
  • I don't understand, what is `\\bst`? – hippietrail May 11 '13 at 01:43
  • A way to match the boundaries between Han, Hiragana, and Katakana would assist but not solve this problem on its own. So far I can't even find a way to match those, even with xregexp. You may be interested in a question I just asked about that: http://stackoverflow.com/questions/16492933/regular-expression-to-match-boundary-between-different-unicode-scripts – hippietrail May 11 '13 at 01:47
  • For Japanese it would be better to use a full morphological analyzer. Here's one in JavaScript: https://github.com/takuyaa/kuromoji.js – katspaugh Oct 01 '15 at 08:56

2 Answers2

6

However, the actual problem of separating the Japanese sentence into words is more complicated than it appears, since words are not separated into spaces as is the case, for example, in English.

For example, the sentence 私はマーケットに行きました。 ("I went to the market") has the following words:

  • 私 - watakushi
  • は - wa
  • マーケット - maaketto
  • に - ni
  • 行きました - ikimashita
  • 。 - (period)

A reliable parser of Japanese sentences would, among other things, have to find where the particles (wa and ni) lie in the sentence, in order to find the remaining words.

Peter O.
  • 32,158
  • 14
  • 82
  • 96
  • 3
    Yes, this is really hard; you have to have big dictionaries of words, and heuristics for guessing what words are more likely to be meant when a sequence of characters (especially kana) are used. It's possible to make puns where you could read a sentence in more than one way, so ultimately the task is not completely solvable, and there's very little you can do with tools as blunt as regex (never mind JavaScript's Unicode-ignorant regexps). – bobince Oct 28 '11 at 13:40
4

\b, as well as \w and \W, isn't Unicode-aware in JavaScript. You have to define your word boundaries as a specific character set. Like (^|$|[\s.,:\u3002]+) or similar.

\u3002 is from ('。'.charCodeAt(0)).toString(16). Is it a punctuation symbol in Japanese?

Or, a contrario, define a Unicode range of word-constructing letters and negate it:

var boundaries = /(^|$|\s+|[^\u30A0–\u30FA]+)/g;

The example katakana range taken from http://www.unicode.org/charts/PDF/U30A0.pdf.

katspaugh
  • 17,449
  • 11
  • 66
  • 103
  • I think yes. '。' is a punctuation symbol – Andrei Oct 28 '11 at 10:21
  • 1
    Yes, it is a full stop, one of the few reliable ways of splitting at word (sentence) boundaries. Doing better than that is very hard (as per Peter's answer). – bobince Oct 28 '11 at 13:36