219

If you double-click English text in Chrome, the whitespace-delimited word you clicked on is highlighted. This is not surprising. However, the other day I was clicking while reading some text in Japanese and noticed that some words were highlighted at word boundaries, even though Japanese doesn't have spaces. Here's some example text:

どこで生れたかとんと見当がつかぬ。何でも薄暗いじめじめした所でニャーニャー泣いていた事だけは記憶している。

For example, if you click on 薄暗い, Chrome will correctly highlight it as a single word, even though it's not a single character class (this is a mix of kanji and hiragana). Not all the highlights are correct, but they don't seem random.

How does Chrome decide what to highlight here? I tried searching the Chrome source for "japanese word" but only found tests for an experimental module that doesn't seem active in my version of Chrome.

Andreas Rossberg
  • 34,518
  • 3
  • 61
  • 72
polm23
  • 14,456
  • 7
  • 35
  • 59
  • In the example provided, the boundaries are simply the changes in syllabary (hiragana / kanji / hiragana, etc.) – Strawberry May 09 '20 at 16:18
  • @Strawberry that's not true. 薄 and 暗 are kanji, い is hiragana, and the following letter, じ, is also hiragana. – N. Virgo May 09 '20 at 22:22
  • 1
    @Nathaniel I don't know how it is for you, but when I double click on the kanji, it only selects the kanji, and when I double click in in the hiragana, it only selects consecutive hiragana, and same for the little bit of katakana (nya nya) – Strawberry May 09 '20 at 22:28
  • 6
    The じめじめした part is a good part to use in testing whether the browser is actually doing intelligent word selection rather than just stopping the selection at kana/kanji/rōmaji boundaries. It’s all hiragana, but Chrome (and Safari) correctly select just the じめじめ part (the した part is a verb inflection). Firefox on the other hand incorrectly selects いじめじめした (because Firefox doesn’t recognize the actual word boundaries at all, but apparently just stops the selection at kana/kanji/rōmaji boundaries). – sideshowbarker May 10 '20 at 02:30
  • 2
    @Strawberry I see. For me it selects the word 薄暗い, as described in the question. (Chrome, Mac.) – N. Virgo May 10 '20 at 08:46
  • Some other test cases: ゆっくりとぼとぼ歩く人 • それはわくわくする夜の行事です • 私をそんなにじろじろ見ないでくれ – sideshowbarker May 19 '20 at 02:39
  • 1
    With one exception, in every single macOS app I’ve tested in — TextEdit, Stickies, Notes, Terminal, etc. — double-click intelligent word selection of Japanese text works as expected. So on macOS at least, Chrome isn’t doing anything special for this that virtually all other macOS apps aren’t also doing — it’s just using the existing ICU-based word-breaking support built into macOS. – sideshowbarker May 19 '20 at 02:52
  • 1
    On macOS, Firefox is the only exception I’ve found to the rule that macOS apps can all do the same kind of double-click intelligent word selection of Japanese text described in this question. Firefox seems to only do the much-simpler thing of just stopping the selection at kana/kanji/rōmaji boundaries. I’ve been told by a Firefox engineer that’s because Firefox doesn’t use the built-in ICU-based macOS platform APIs for text selection. See related bug https://bugzil.la/345823. – sideshowbarker May 19 '20 at 02:58
  • https://icu4c-demos-7hxm2n5zgq-uc.a.run.app/icu-bin/icusegments is an online tool that allows you to input some text (in Japanese or basically any other language) and then shows you how it gets segmented into words/graphemes per ICU Segmentation boundaries. If you try it you’ll find that the word-selection behavior you see in Chrome is consistent with the segmentation behavior you get from that tool — which confirms that the Chrome behavior is just due to it relying on ICU. (And in the case of Chrome on macOS or Windows at least, it’s just relying on the ICU support built into the OS.) – sideshowbarker May 22 '20 at 13:14

3 Answers3

169

So it turns out v8 has a non-standard multi-language word segmenter and it handles Japanese.

function tokenizeJA(text) {
  var it = Intl.v8BreakIterator(['ja-JP'], {type:'word'})
  it.adoptText(text)
  var words = []

  var cur = 0, prev = 0

  while (cur < text.length) {
    prev = cur
    cur = it.next()
    words.push(text.substring(prev, cur))
  }

  return words
}

console.log(tokenizeJA('どこで生れたかとんと見当がつかぬ。何でも薄暗いじめじめした所でニャーニャー泣いていた事だけは記憶している。'))
// ["どこ", "で", "生れ", "たか", "とんと", "見当", "が", "つ", "か", "ぬ", "。", "何でも", "薄暗い", "じめじめ", "した", "所", "で", "ニャーニャー", "泣", "い", "て", "いた事", "だけ", "は", "記憶", "し", "て", "いる", "。"]

I also made a jsfiddle that shows this.

The quality is not amazing but I'm surprised this is supported at all.

polm23
  • 14,456
  • 7
  • 35
  • 59
  • 27
    This is a part of the ICU project: http://userguide.icu-project.org/boundaryanalysis, also see http://www.unicode.org/reports/tr29/#Word_Boundaries – Xorlev May 08 '20 at 19:47
  • 10
    Also see https://source.chromium.org/chromium/chromium/src/+/master:v8/src/objects/js-break-iterator.cc;l=69-93;drc=7cba1ed502e90ece18cccc568eb2fb986b085aa1?originalUrl=https:%2F%2Fcs.chromium.org%2F for where that's wired in. – Xorlev May 08 '20 at 19:53
  • 4
    Windows already have the ability to select the correct word when double clicking on a Japanese word. You don't even need Chrome for this – phuclv May 09 '20 at 05:40
  • 7
    @phuclv: Not everyone who uses Chrome runs it on Windows. – Vikki May 09 '20 at 23:07
  • 2
    Are you sure the v8 behavior has any affect on text selection in the browser UI? Given that v8’s a JavaScript engine, I wouldn’t think that any of the v8 code would be executing while you’re doing text selection in the browser UI. I guess you could check by disabling JavaScipt in the browser and then seeing if you observe the same behavior. If you don’t, then I would think that’d show the behavior isn’t due to v8. (I would do that myself to test it, but as I noted in another comment, in my macOS environment, this already works regardless of which browser I test in — not just in Chrome.) – sideshowbarker May 10 '20 at 02:13
  • By the time this feature gets implemented in chrome on Android I will probably be in a grave. I've been waiting this for years! and something tells me I can still wait a long time, I don't know why the team hasn't implemented that yet.... it has to be frustrating for Japanese people also. – vdegenne Jan 09 '23 at 22:28
95

Based on links posted by JonathonW, the answer basically boils down to: "There's a big list of Japanese words and Chrome checks to see if you double-clicked in a word."

Specifically, v8 uses ICU to do a bunch of Unicode-related text processing things, including breaking text up into words. The ICU boundary-detection code includes a "Dictionary-Based BreakIterator" for languages that don't have spaces, including Japanese, Chinese, Thai, etc.

And for your specific example of "薄暗い", you can find that word in the combined Chinese-Japanese dictionary shipped by ICU (line 255431). There are currently 315,671 total Chinese/Japanese words in the list. Presumably if you find a word that Chrome doesn't split properly, you could send ICU a patch to add that word.

erjiang
  • 44,417
  • 10
  • 64
  • 100
  • 7
    [*Windows \[also\] uses a dictionary lookup approach for double-click selection*](https://r12a.github.io/scripts/tutorial/part5) – phuclv May 09 '20 at 08:37
  • ICU and similar projects have been around for a long time. I wouldn’t be surprised if Chrome’s V8 engine picked it up after they transitioned from WebKit, which originated on platforms where the standard text engines have been doing this sort of tokenization for almost 20 years. – rickster May 10 '20 at 07:00
1

It is still rudimentary (2022-11-27) but Google progresses very fast in the various fields of language parsing. As of today's state of the code, Google Chrome broke |生れ|たか| and |泣|い|て|いた事|, both 'たか' and 'いた事' are odd lexically since both 'たか' and 'いた' (A) are usually used 'agglutinated' with the previous string 99,9% of the time (B) have very little meaning (frequency usage beyond the 10000th rank).

For Chinese and Japanese anyone can get better results with a vocabulary list of just 100,000 items (you add to the list as you read) that you organize from longest strings to shorter (single characters), for Chinese I set the length at 5 characters maximum, anything bigger is the name of an organization or such, for Japanese I set the maximum at 9 char length. Tonal languages have (65%) shorter words compared to non-tonal.

To parse a paragraph you launch a "do while" loop that starts from the first character and tries to find first the longest possible string in the vocabulary list, if that wasn't successful the search proceed towards the end of the list to shorter parts of words with less meaning, till it gets too simple letters or rare single-characters (you need to have all these single items, like, all 6,000 kanji/hanzi for daily reading).
You set a separator when you encounter punctuation or numbers and you skip to the next word.

It would be easier if I showed this at work but I don't know if people are interested and if I can post video links here.

Neha Soni
  • 3,935
  • 2
  • 10
  • 32