2

Text in Thai is written with no spaces between the words. Instead, spaces indicate a break, like a comma or the end of a sentence. For example, the string พูดไปสองไพเบี้ย นิ่งเสียตำลึงทอง means "Speech is cheap; silence is golden" with the space acting like my semi-colon.

I'm working on an algorithm to detect the word boundaries in Thai text for a Chrome extension. Google Chrome is able to split Thai text along word boundaries at the end of lines. This article indicates that Chrome uses the ICU4C library to achieve this.

Is there a way to access Chrome's Thai word-segmentation feature from JavaScript?

Community
  • 1
  • 1
James Newton
  • 6,623
  • 8
  • 49
  • 113
  • There are quite a few ready-made js libraries that do the task ([just for one](http://not.siit.net/members/art/thaiwrap.html)). – Be Brave Be Like Ukraine Aug 15 '16 at 17:24
  • Thanks for the link. However, it states clearly (emphasis in the original) `this bookmarklet ... DOES NOT provide any linguistic correctness (e.g. correct word boundaries)` – James Newton Aug 15 '16 at 18:58
  • 1
    That's right. This is why I did not post it as an answer. Simply speaking, this code finds boundaries of *syllables* by using simple rules: "before `เ, แ, ไ, ใ, โ`", "after `ะ`", etc. Line wrap at syllable boundaries is quite acceptable in Thai language. Whenever you need boundaries of polysyllabic, *meaningful words*, sooner or later you end up with using large dictionaries where these meaningful words (tens of thousands) are listed. – Be Brave Be Like Ukraine Aug 15 '16 at 19:48
  • I'm not sure about the access to Google Chrome library but this such problem is call word segmentation. Since word in Thai language are written continuously, there must be an algorithm to assign word boundary as you expected. However, to get very accuracy is somehow very complex because of the language itself and the language user opinions on how to segment words. – spicydog Aug 22 '16 at 07:55
  • I am also doing research on this topic and I know some library that you can use but not in JS. First, LexTo, http://www.sansarn.com/lexto/, this one use longest matching. I also have lazily implemented one in JS at https://github.com/spicydog/thai-word-tokenizer. Do not expect high accuracy from these library, it is just okay. If you want to discuss more about this topic you can contact me directly. It's kind of a research topic. – spicydog Aug 22 '16 at 07:55

0 Answers0