BreakIterator API Java

Question

The documentation for BreakIterator.getWordInstance() has options to use it with the Locale parameter, presumably because different locales' end results may vary for methods like (WordInstance, LineInstance, SentenceInstance, CharacterInstance)

But, when I do not use this parameter, I still get the same results as I get when calling it with any Locale in getAvailableLocales().

Is there some pattern, String, or Locale which actually causes these methods to give different results?

Some locale (and also some language) is always used with `BreakIterator`, because if no locale is specified the default locale is used. But you will usually get the same results regardless of locale/country/language. For example, the word boundaries of some Portuguese text will be the same regardless of whether your locale is Portugal or Sweden or Australia, because the word boundary rules are the same. However, I think it may matter when the text is in a language _similar_ to that for the locale. For example, when extracting Tamil characters from text if using Tegulu as the locale's language. — skomisa, Feb 19 '22 at 19:36

score 0 · Answer 1 · answered Sep 07 '16 at 19:19

I believe all "western" languages have the same rules.

Cursory scan shows that locale th (Thai) has it's own rules, given in file /sun/text/resources/th/WordBreakIteratorData_th inside .../jre/lib/ext/localedata.jar.

It's a binary file, so I don't know what it says, and even if I could understand the file, not knowing Thai, I still wouldn't understand it.

BreakIterator API Java

1 Answers1