What would be a correct approach to creating a function that counts words in more languages than str_word_count()
? Specifically, I need to support Chinese, Japanese, and Korean.
I'm think it would be something like this:
- Somehow check if less than 50% of the characters are multibyte. If true, use
str_word_count()
and return. - Remove all continuous alphanumeric characters and add 1 to the count for each (assume these are Western words).
- Remove all spaces and punctuation. Add string length to count.
- Return count.
Are there better approaches? I can think of some flaws off the top of my head: accented characters, multibyte languages that use spaces to delimit words (e.g. Arabic, I believe).