13

I want to use str_word_count() on a UTF-8 string.

Is this safe in PHP? It seems to me that it should be (especially considering that there is no mb_str_word_count()).

But on php.net there are a lot of people muddying the water by presenting their own 'multibyte compatible' versions of the function.

So I guess I want to know...

  1. Given that str_word_count simply counts all character sequences in delimited by " " (space), it should be safe on multibyte strings, even though its not necessarily aware of the character sequences, right?

  2. Are there any equivalent 'space' characters in UTF-8, which are not ASCII " " (space)?#

This is where the problem might lie I guess.

hakre
  • 193,403
  • 52
  • 435
  • 836
carpii
  • 1,917
  • 4
  • 20
  • 24
  • The user notes on the manual page for the function have some custom implementations for a UTF-8 version, so I guess the built-in one doesn't play nice with it: http://www.php.net/manual/en/function.str-word-count.php – BoltClock Nov 28 '11 at 01:29
  • 1
    Note that the concept of a "word count" may be kind of squishy for multilingual input anyway, as not all languages have explicit word separators. (Chinese and Japanese, for instance, have none.) –  Nov 28 '11 at 02:44
  • The question won [more general discussions afeter 2013's bounty, see below](http://stackoverflow.com/a/19274144/287948). – Peter Krauss Oct 14 '13 at 21:36
  • http://stackoverflow.com/questions/21652261/str-word-count-alternative-for-utf8 – trante Feb 09 '14 at 13:34
  • My PHP 8.1 solution can be seen in other same problem solution; https://stackoverflow.com/a/73352924/6638705 – selcuk mart Aug 14 '22 at 15:42

4 Answers4

4

I'd say you guess right. And indeed there are space characters in UTF-8 which are not part of US-ASCII. To give you an example of such spaces:

And perhaps as well:

Anyway, the first one - the 'NO-BREAK SPACE' (U+00A0) - is a good example as it is also part of Latin-X charsets. And the PHP manual already provides a hint that str_word_count would be locale dependent.

If we want to put this to a test, we can set the locale to UTF-8, pass in an invalid string containing a \xA0 sequence and if this still counts as word-breaking character, that function is clearly not UTF-8 safe, hence not multibyte safe (as same non-defined as per the question):

<?php
/**
 * is PHP str_word_count() multibyte safe?
 * @link https://stackoverflow.com/q/8290537/367456
 */

echo 'New Locale: ', setlocale(LC_ALL, 'en_US.utf8'), "\n\n";

$test   = "aword\xA0bword aword";
$result = str_word_count($test, 2);

var_dump($result);

Output:

New Locale: en_US.utf8

array(3) {
  [0]=>
  string(5) "aword"
  [6]=>
  string(5) "bword"
  [12]=>
  string(5) "aword"
}

As this demo shows, that function totally fails on the locale promise it gives on the manual page (I do not wonder nor moan about this, most often if you read that a function is locale specific in PHP, run for your life and find one that is not) which I exploit here to demonstrate that it by no means does anything regarding the UTF-8 character encoding.

Instead for UTF-8 you should take a look into the PCRE extension:

PCRE has a good understanding of Unicode and UTF-8 in PHP in specific. It can also be quite fast if you craft the regular expression pattern carefully.

Community
  • 1
  • 1
hakre
  • 193,403
  • 52
  • 435
  • 836
  • Thanks to explain better and point new problems... But, I not understand the first one: non-letters like `chr(160)` (=dec(A0)) are word-separators, and `"\xA0"` is not an UTF-8 symbol, you must use `"\xC2\xA0"` to be valid character in UTF-8. – Peter Krauss Oct 11 '13 at 10:49
  • About [your suggestion](http://stackoverflow.com/questions/4983392) (to use PCRE), [it fails exactly with your first pointed-problem](http://stackoverflow.com/questions/19316127). – Peter Krauss Oct 11 '13 at 10:55
  • @PeterKrauss: I don't understand your description of what you don't understand. Which makes it a little complicated to comment on it :/. And yes, when you operate on UTF-8, you need to have UTF-8, not something else, like the bespoken ISO-8859 (written as LATIN-X) in the answer. And it does not point new problems in context of the question, it only reflects the assumptions expressed in the OP and shows that they are valid by concrete examples and which additional limitations the `str_word_count` functions has regarding character encodings and it's documentation. – hakre Oct 11 '13 at 19:27
  • @hakre I don't understand why `\xA0` should be considered invalid `utf-8`. Also why shouldn't a non-breaking space be considered a word boundary? – Sébastien Oct 13 '13 at 01:21
  • About `\xA0` (single octet, not followed by anything else) is not an Unicode character encoded as UTF-8. Hence it has to be considered invalid (or just not UTF-8). For word boundaries, I didn't made any considerations my own in the answer, I just showed that UTF-8 is not reflected by `str_word_count` even the PHP manual page says it's locale dependent but it ignores the UTF-8 encoding of the locale (the input encoding is invalid for that locale). – hakre Oct 13 '13 at 10:08
  • @Sébastien the problem of UTF8 and "acceptable non-UTF8", and error/non-error, was better [discussed here](http://stackoverflow.com/q/19316127/287948); and a new optional [`preg_word_count()` was showed there](http://stackoverflow.com/a/19316871/287948) for compare behaviours. – Peter Krauss Oct 14 '13 at 19:39
  • Thanks @hakre and Peter.Krauss I still need to learn more about that and I love UTF-8! – Sébastien Oct 14 '13 at 20:21
  • Revoked previous 'best answer' and awarded it to this one, because its more informative. – carpii Oct 28 '13 at 08:00
1

About the "template answer" - I don't get the demand "working faster". We're not talking about long times or lot of counts here, so who cares if it takes some milliseconds longer or not?

However, a str_word_count working with soft hyphen:

function my_word_count($str) {
  return str_word_count(str_replace("\xC2\xAD",'', $str));
}

a function that complies with the asserts (but is probably not faster than str_word_count):

function my_word_count($str) {
  $mystr = str_replace("\xC2\xAD",'', $str);        // soft hyphen encoded in UTF-8
  return preg_match_all('~[\p{L}\'\-]+~u', $mystr); // regex expecting UTF-8
}

The preg function is essentially the same what's already proposed, except that a) it already returns a count so no need to supply matches, which should make it faster and b) there really should not be iconv fallback, IMO.


About a comment:

I can see that your PCRE functions are wrost (performance) than my preg_word_count() because need a str_replace that you not need: '~[^\p{L}\'-\xC2\xAD]+~u' works fine (!).

I considered that a different thing, string replace will only remove the multibyte character, but regex of yours will deal with \\xC2 and \\xAD in any order they might appear, which is wrong. Consider a registered sign, which is \xC2\xAE.

However, now that I think about it due to the way valid UTF-8 works, it wouldn't really matter, so that should be usable equally well. So we can just have the function

function my_word_count($str) {
  return preg_match_all('~[\p{L}\'\-\xC2\xAD]+~u', $str); // regex expecting UTF-8
}

without any need for matches or other replacements.

About str_word_count(str_replace("\xC2\xAD",'', $str));, if is stable with UTF8, is good, but seems is not.

If you read this thread, you'll know str_replace is safe if you stick to valid UTF-8 strings. I didn't see any evidence in your link of the contrary.

Community
  • 1
  • 1
eis
  • 51,991
  • 13
  • 150
  • 199
  • Hello, I not try to use your functions yet (will do later), but I can see that your PCRE functions are wrost (performance) than my `preg_word_count()` because need a `str_replace` that you not need: `'~[^\p{L}\'\-\\xC2\\xAD]+~u'` works fine (!). About `str_word_count(str_replace("\xC2\xAD",'', $str));`, if is stable with UTF8, is good, but [seems is not](http://stackoverflow.com/q/3786003/287948). – Peter Krauss Oct 16 '13 at 13:42
  • @PeterKrauss responded in my answer. In short, it does seem so that with regex, replacement is not strictly needed if we work with valid UTF-8 strings. However there will not be any problems with str_replace either. – eis Oct 16 '13 at 15:07
  • Ok (!). Ops, more one: about the initial question in your answer, *"demand 'working faster'... We're not talking about long times ..."*, ok, I must introduce a criteria: **suppose a UTF8 text with more than 10000 words**, and that "the faster function is the better!" (here no matter if 1.1 times or 10 times faster than the my reference-functions), when input have that length. – Peter Krauss Oct 16 '13 at 19:28
  • End of bounty: I voted both users hakre and you, so, can not award only one with the full-bounty. Perhaps only hakre wins because need minimal of 2 votes – Peter Krauss Oct 17 '13 at 15:40
0

EDITED (to show new clues): there are a possible solution using str_word_count() with PHP v5.1!

function my_word_count($str, $myLangChars="àáãâçêéíîóõôúÀÁÃÂÇÊÉÍÎÓÕÔÚ") { 
    return str_word_count($str, 0, $myLangChars);
}

but not is 100% because I try to add to $myLangChars \xC2\xAD (the SHy - SOFT HYPHEN character), that must be a word component in any language, and it not works (see).

Another, not so fast, but complete and flexible solution (extracted from here), based on PCRE library, but with an option to mimic the str_word_count() behaviour on non-valid-UTF8:

 /**
  * Like str_word_count() but showing how preg can do the same.
  * This function is most flexible but not faster than str_word_count.
  * @param $wRgx the "word regular expression" as defined by user.
  * @param $triggError changes behaviour causing error event.
  * @param $OnBadUtfTryAgain when true mimic the str_word_count behaviour.
  * @return 0 or positive integer as word-count, negative as PCRE error.
  */
 function preg_word_count($s,$wRgx='/[-\'\p{L}\xC2\xAD]+/u', $triggError=true,
                          $OnBadUtfTryAgain=true) {
   if ( preg_match_all($wRgx,$s,$m) !== false )
      return count($m[0]);
   else {
      $lastError = preg_last_error();
      $chkUtf8 = ($lastError==PREG_BAD_UTF8_ERROR);
      if ($OnBadUtfTryAgain && $chkUtf8) 
         return preg_word_count(
            iconv('CP1252','UTF-8',$s), $wRgx, $triggError, false
         );
      elseif ($triggError) trigger_error(
         $chkUtf8? 'non-UTF8 input!': "error PCRE_code-$lastError",
         E_USER_NOTICE
         );
      return -$lastError;
   }
 }

(TEMPLATE ANSWER) help for bounty!

(this is not an answer, is a help for bounty, because I can not edit neither to duplicate the question)

We want to count "real-world words" in a UTF-8 latim text.

FOR BOUNTY, WE NEED:

  • a function that comply the asserts below and is faster than str_word_count;
  • or str_word_count working with SHy character (how to?);
  • or preg_word_count working faster (using preg_replace? word-separator regular expression?).

ASSERTS

Supose that a "multibyte safe" function my_word_count() exists, then the following asserts must be true:

assert_options(ASSERT_ACTIVE, 1);

$text = "1,2,3,4=0 (1 2 3 4)=0 (... ,.)=0  (2.5±0.1; 0.5±0.2)=0";
assert( my_word_count($text)==0 ); // no word there 

$text = "(one two,three;four)=4 (five-six se\xC2\xADven)=2";
assert( my_word_count($text)==6 ); // hyphen merges two words 

$text = "(um±dois três)=3 (àáãâçêéíîóõôúÀÁÃÂÇÊÉÍÎÓÕÔÚ)=1";
assert( my_word_count($text)==4 ); // a UTF8 case 

$text = "(ÍSÔ9000-X, ISÔ 9000-X, ÍSÔ-9000-X)=6"; //Codes are words?
assert( my_word_count($text)==6 ); // suppose no: X is another word
Community
  • 1
  • 1
Peter Krauss
  • 13,174
  • 24
  • 167
  • 304
  • **End of bounty**: I voted both users @hakre and eis, so, can not award only one with the full-bounty. Perhaps only hakre wins because need minimal of 2 votes. – Peter Krauss Oct 17 '13 at 15:38
-2

All it does it count the number of spaces, or words in between. if you're curious, you can just make your own counting function using explode and count.

Anytime the ascii space byte is found, it splits and that all there really is to it.

Adam Fowler
  • 1,750
  • 1
  • 17
  • 18
  • 5
    Not to be difficult but that is not strictly true. For example it will split words on numbers and other UTF8 special characters (such as the apostrophe from MS Word). As noted by @boltclock it is not by it's nature UTF8 friendly. http://php.net/manual/en/function.str-word-count.php – Tony Sep 02 '13 at 19:54