2

I'm trying to separate a sentence word by word but it seems like it is a very hard task with JavaScript. I can't simply separate the sentence by looking at the whitespace. Because there are languages (Thai, Chinese, Japanese, etc.) that don't use whitespace to separate words. Therefore a dictionary-based algorithm seems like the way to go. However, the dictionaries have a large size and I'm trying to separate the sentence on the client.

Java has a BreakIterator class that allows you to iterate through the words in the sentence. That's exactly what I need but JS doesn't have the same functionality. Chrome has Intl.v8BreakIterator but I'm looking for a solution for all major browsers.

There is a proposal, Intl.Segmenter, that would solve the issue. It's basically BreakIterator on Javascript. But it wasn't released yet.

If there is way, can you please point me in the right direction?

batatop
  • 979
  • 2
  • 14
  • 31
  • 1
    The point of the proposal is that there's nothing like this and to avoid people having to build their own using large dictionaries. So, until that proposal goes through, you could use the [polyfill provided on the proposal's site](https://gist.github.com/inexorabletash/8c4d869a584bcaa18514729332300356) (which is labeled as not to be used for production) or write your own. – Heretic Monkey Oct 21 '20 at 14:10
  • @HereticMonkey Thank you for your comment. I saw the link that you sent. However, it says: "Uses Intl.v8BreakIterator if present (which in turn uses ICU to do the actual work), otherwise uses a very poorly written, English-only segmenter." and I think Intl.v8BreakIterator is only available on Chromium-based browsers. Do you know if there is a less reliable way than dictionaries, that would get the job done to some extent? :) – batatop Oct 21 '20 at 14:14
  • There are libraries, but as you can see from [this similar question](https://stackoverflow.com/q/23470062/215552), asking for them is off-topic. I did a search on "unicode text segmenter javascript" and found a few... – Heretic Monkey Oct 21 '20 at 14:19
  • @HereticMonkey I think the best way, for now, is using the Polyfill that you mentioned for the languages that don't have whitespaces. And telling the user to use Chrome with those languages. – batatop Oct 21 '20 at 14:30
  • @Seabizkit, I accidentally deleted your comment while I was trying to delete mine. You said [this link](https://stackoverflow.com/questions/45619497/c-sharp-split-a-string-with-mixed-language-into-different-language-chunks) might help me. I checked it but I think it is for segmenting languages by looking at their Unicode value. However, I'm trying to separate sentences word by word. And the sentence is going to be on the same Unicode block. Because all words will (most probably) be in the same language. – batatop Oct 21 '20 at 14:49

1 Answers1

1

It seems you may have to use the spread operator:

const text = '中國是最古老的文明';
const splitString = [...text];
console.log(splitString);

But then again, I'm not too sure if that's what you're trying to do since I'm not sure what the Chinese language/characters mean/read. But I read this somewhere a while ago.

Atlante Avila
  • 1,340
  • 2
  • 16
  • 37
  • That separates the characters (code points, technically). See [How do you get a string to a character array in JavaScript?](https://stackoverflow.com/q/4547609/215552), especially [this answer](https://stackoverflow.com/a/33233956/215552) – Heretic Monkey Oct 21 '20 at 14:22
  • looks like for c each one is a word – Seabizkit Oct 21 '20 at 14:25
  • @AtlanteAvilia Thank you for your answer. But I don't think that all the languages use a single letter for one word. For example, Thai can have multiple letters for a word and doesn't have whitespaces. – batatop Oct 21 '20 at 14:26
  • you are correct but i don't think there is a way? you would have to code for each language. – Seabizkit Oct 21 '20 at 14:27
  • @Seabizkit Browsers have to do it somehow. I read that Firefox uses the built-in tool in the OS. You can check Comment 4 on [this link](https://bugzilla.mozilla.org/show_bug.cgi?id=1423593). And probably Chrome uses the Intl.v8BreakIterator that they exposed. – batatop Oct 21 '20 at 14:41
  • @batatop what makes u say browsers have to do it? browsers just* "render" markup, they are a rendering program for html, and can run java-script. interesting https://code.google.com/archive/p/v8-i18n/wikis/BreakIterator.wiki – Seabizkit Oct 21 '20 at 14:49
  • @Seabizkit For example, when you are typing a sentence in English, if the next word you are typing won't fit on the current line, it will automatically continue from the next line. It can easily do that because the words are separated by whitespaces. I thought it should have the same behaviour in languages that don't use whitespaces too. Please correct me if I'm wrong. – batatop Oct 21 '20 at 14:55