0

What would be a correct approach to creating a function that counts words in more languages than str_word_count()? Specifically, I need to support Chinese, Japanese, and Korean.

I'm think it would be something like this:

  1. Somehow check if less than 50% of the characters are multibyte. If true, use str_word_count() and return.
  2. Remove all continuous alphanumeric characters and add 1 to the count for each (assume these are Western words).
  3. Remove all spaces and punctuation. Add string length to count.
  4. Return count.

Are there better approaches? I can think of some flaws off the top of my head: accented characters, multibyte languages that use spaces to delimit words (e.g. Arabic, I believe).

Leo Jiang
  • 24,497
  • 49
  • 154
  • 284
  • 1
    Probably related questions: http://stackoverflow.com/questions/8290537/is-php-str-word-count-multibyte-safe and http://stackoverflow.com/questions/11084623/creating-an-effective-word-counter-including-chinese-japanese-and-other-accented – feeela Mar 17 '14 at 19:46
  • 1
    To count words, you need to know what a word is. `str_word_count` works on the assumption that words a delimited by some space character – which is not the case for many Asian languages. Your algorithm outline looks fine, but I would bet that there are existing solutions out there. – feeela Mar 17 '14 at 19:49
  • @feeela I couldn't find any existing solutions, and neither of those questions are of much help. – Leo Jiang Mar 17 '14 at 23:35

1 Answers1

3

What about using ICU? Which is interfaced in PHP by intl extension (class IntlBreakIterator).

Something like this:

function utf8_word_count($string, $mode = 0) {
    static $it = NULL;

    if (is_null($it)) {
        $it = IntlBreakIterator::createWordInstance(ini_get('intl.default_locale'));
    }

    $l = 0;
    $it->setText($string);
    $ret = $mode == 0 ? 0 : array();
    if (IntlBreakIterator::DONE != ($u = $it->first())) {
        do {
            if (IntlBreakIterator::WORD_NONE != $it->getRuleStatus()) {
                $mode == 0 ? ++$ret : $ret[] = substr($string, $l, $u - $l);
            }
            $l = $u;
        } while (IntlBreakIterator::DONE != ($u = $it->next()));
    }

    return $ret;
}

(implies intl extension enabled and PHP >= 5.5.0)

julp
  • 3,860
  • 1
  • 22
  • 21
  • I've been using this with pretty good results. A few notes that others may find interesting. (1) Getting this to work with Chinese I found I had to upgrade Intl and ICU (I didn't experiment to find min versions) (2) Passing the locale argument seems to make no difference. Script can be mixed it seems. (would like to be proven wrong) (3) I found the simpler `foreach` iteration was faster than the example above. – Tim Sep 16 '16 at 12:16