Is it possible to get word boundaries of SE Asian scripts via JavaScript?

Question

My goal is to break SE Asian texts into words, preferably from within a browser. While this is trivial to do for western languages using regex or simply splitting on spaces, it's a much tougher problem for some scripts. E.g. find the word boundaries in this line:

เขาสามารถทำในสิ่งที่ต้องการต่อไปได้

Modern browsers do detect the word boundaries, however. This can be observed by double-clicking on the text above. Only the word within the line gets highlighted, not the entire block. From my research so far, this word boundary determination is done by native libraries on each platform. Is it possible to get these word break boundaries via JavaScript?

I don't know. I will just warn you that in JavaScript, a character isn't a character, which will need to be handled in whatever solution you come up with. It's a UTF-16 code unit, and so a single character can take up two places in a JavaScript string because it's represented with a [surrogate pair](http://www.unicode.org/faq/utf_bom.html#utf16-1). Also note that JavaScript strings tolerate invalid pairs. — T.J. Crowder, Sep 20 '15 at 10:34
No, the native algorithm of the browser word highlighting is not accessible to JavaScript. However it is possible to reproduce it. — Bergi, Sep 20 '15 at 12:35
Thank you, Bergi. If you want to make that an answer, I'll choose it. — Mark Wilbur, Sep 20 '15 at 17:18

Is it possible to get word boundaries of SE Asian scripts via JavaScript?

0 Answers0