7

The NO-BREAK SPACE and many other UTF-8 symbols need 2 bytes to its representation; so, in a supposed context of UTF8 strings, an isolated (not preceded by xC2) byte of non-ASCII (>127) is a non-recognized character... Ok, it is only a layout problem (!), but it corrupts the whole string?

How to avoid this "non-expected behaviour"? (it occurs in some functions and not in others).

Example (generating an non-expected behaviour with preg_match only):

  header("Content-Type: text/plain; charset=utf-8"); // same if text/html
  //PHP Version 5.5.4-1+debphp.org~precise+1
  //using a .php file enconded as UTF8.

  $s = "THE UTF-8 NO-BREAK\xA0SPACE"; // a non-ASCII byte
  preg_match_all('/[-\'\p{L}]+/u',$s,$m);
  var_dump($m);            // empty! (corrupted)
  $m=str_word_count($s,1);
  var_dump($m);            // ok

  $s = "THE UTF-8 NO-BREAK\xC2\xA0SPACE";  // utf8-encoded nbsp
  preg_match_all('/[-\'\p{L}]+/u',$s,$m);
  var_dump($m);            // ok!
  $m=str_word_count($s,1);
  var_dump($m);            // ok
Peter Krauss
  • 13,174
  • 24
  • 167
  • 304
  • This could be of interest http://php.net/manual/en/reference.pcre.pattern.modifiers.php#54805 – Pebbl Oct 11 '13 at 11:05
  • 3
    To summarize the question: Why do some functions fail entirely on *invalidly encoded strings* and/or how to avoid that? As for *why*: because `preg_*` delegates to the PCRE regex C library, while other functions like `str_word_count` are based on other libraries and the authors of these different libraries had different opinions/requirements for error handling. – deceze Oct 11 '13 at 11:24
  • See also a [related question](http://stackoverflow.com/a/19274144/287948). See [my answer-of-comments consolidated below](http://stackoverflow.com/a/19316871/287948). – Peter Krauss Oct 14 '13 at 21:32

2 Answers2

5

This is not a complete answer because I not say why some PHP functions "fail entirely on invalidly encoded strings" and others not: see @deceze at question's comments and @hakre answer. If you are looking for an PCRE-replacement for str_word_count(), see my preg_word_count() below.

PS: about "PHP5's build-in-library behaviour uniformity" discussion, my conclusion is that PHP5 is not so bad, but we have create a lot of user-defined wrap (façade) functions (see diversity of PHP-framworks!)... Or wait for PHP6 :-)


Thanks @pebbl! If I understand your link, there are a lack of error messagens on PHP. So a possible workaround of my illustred problem is to add an error condition... I find the condition here (it ensures valid utf8!)... And thanks @deceze for remember that exists a build-in function for check this condition (I edited the code after).

Putting the issues together, a solution translated to a function (EDITED, thanks to @hakre comments!),

 function my_word_count($s,$triggError=true) {
   if ( preg_match_all('/[-\'\p{L}]+/u',$s,$m) !== false )
      return count($m[0]);
   else {
      if ($triggError) trigger_error(
         // not need mb_check_encoding($s,'UTF-8'), see hakre's answer, 
         // so, I wrong, there are no 'misteious error' with preg functions
         (preg_last_error()==PREG_BAD_UTF8_ERROR)? 
              'non-UTF8 input!': 'other error',
         E_USER_NOTICE
         );
      return NULL;
   }
 }

Now (edited after thinking around @hakre answer), about uniform behaviour: we can develop a reasonable function with PCRE library that mimic the str_word_count behaviour, accepting bad UTF8. For this task I used the @bobince iconv tip:

 /**
  * Like str_word_count() but showing how preg can do the same.
  * This function is most flexible but not faster than str_word_count.
  * @param $wRgx the "word regular expression" as defined by user.
  * @param $triggError changes behaviour causing error event.
  * @param $OnBadUtfTryAgain mimic the str_word_count behaviour.
  * @return 0 or positive integer as word-count, negative as PCRE error.
  */
 function preg_word_count($s,$wRgx='/[-\'\p{L}]+/u', $triggError=true,
                          $OnBadUtfTryAgain=true) {
   if ( preg_match_all($wRgx,$s,$m) !== false )
      return count($m[0]);
   else {
      $lastError = preg_last_error();
      $chkUtf8 = ($lastError==PREG_BAD_UTF8_ERROR);
      if ($OnBadUtfTryAgain && $chkUtf8) 
         return preg_word_count(
            iconv('CP1252','UTF-8',$s), $wRgx, $triggError, false
         );
      elseif ($triggError) trigger_error(
         $chkUtf8? 'non-UTF8 input!': "error PCRE_code-$lastError",
         E_USER_NOTICE
         );
      return -$lastError;
   }
 }

Demonstrating (try other inputs!):

 $s = "THE UTF-8 NO-BREAK\xA0SPACE"; // a non-ASCII byte
 print "\n-- str_word_count=".str_word_count($s,0);
 print "\n-- preg_word_count=".preg_word_count($s);

 $s = "THE UTF-8 NO-BREAK\xC2\xA0SPACE";  // utf8-encoded nbsp
 print "\n-- str_word_count=".str_word_count($s,0);
 print "\n-- preg_word_count=".preg_word_count($s);
Community
  • 1
  • 1
Peter Krauss
  • 13,174
  • 24
  • 167
  • 304
  • Or just use `mb_check_encoding`. – deceze Oct 11 '13 at 11:27
  • preg_match_all *does* check for UTF-8 already (and is more correct than mb_* as well), just deal with the return value properly and you're fine already. My answer has that a bit "extended", please don't feel offended, I hope it can show some general concepts what to look for to not run into these kind of problems too often. – hakre Oct 11 '13 at 20:02
  • Thanks for optimized function suggestions, I edited, check if now it is ok, using return information. My focus are two points now highlighted with the function: 1) the non-uniformity of behaviour, `str_word_count` do something and is good, `preg_match_all` do nothing, losting the process. 2) lack of good message errors with `preg_match_all`. – Peter Krauss Oct 12 '13 at 12:09
  • ops, sorry, now see at your answer, the use of `preg_last_error()`, so, I wrong about item-2 above (!), and edited more. – Peter Krauss Oct 12 '13 at 12:19
  • Well, hmm. I can see you are trying to sovle something here, but if you criticize inconsistencies, I would be very careful (if not even conservative) to put so much logic and differentation into a single function. This is probably a real issue with PHP incosistencies: That some developer try to wish them away so badly that they create such fragile / magic code. You should already know the encoding of the input string and have that defined. *Or* pass it as a parameter. But don't guess, at least not *inside* that function. – hakre Oct 13 '13 at 14:49
  • Let me know if that comment already makes sense to you or not and if not which part. I then try to elaborate that so it's more clear. – hakre Oct 13 '13 at 14:50
  • I see 2 functions up there. Which one is better to use for ultimate flexibility? – CMCDragonkai Nov 08 '13 at 19:36
  • Hello @CMCDragonkai, the first function is illustrative; the second, `preg_word_count()`, is a complete solution. About performance and flexibility, compare with [the build-in function](http://php.net/manual/fr/function.str-word-count.php) and *say here* what is better for your specific problem. – Peter Krauss Nov 09 '13 at 09:22
  • I get this when I run the second function `Notice: iconv(): Wrong charset, conversion from `CP1252' to `UTF-8' is not allowed in /code/jugUer on line 12` – CMCDragonkai Nov 09 '13 at 12:09
  • @CMCDragonkai, perhaps we need to use the chat... See [this another anser](http://stackoverflow.com/a/6723593/287948), there are many ways for use and configure `iconv`, and perhaps you need to use [mb_detect_encoding](http://php.net/manual/en/function.mb-detect-encoding.php) to replace 'CP1252' (see also `mb_convert_encoding`)... Or converts to 'UTF8/IGNORE', etc. You are inputting UTF16 or Chinese or another non-Latin language? – Peter Krauss Nov 10 '13 at 12:16
  • I just copy pasted this `"THE UTF-8 NO-BREAK\xA0SPACE"`. – CMCDragonkai Nov 10 '13 at 17:07
  • Your problem is with PHP configuration ([locale etc.](http://php.net/manual/en/book.iconv.php)), for me is ok: `print iconv('CP1252','UTF-8',"THE UTF-8 NO-BREAK\xA0SPACE");` works with no error. – Peter Krauss Nov 10 '13 at 19:25
3

Okay, I can somewhat feel your disappointment that things didn't worked easily out switching from str_word_count to preg_match_all. However the way you ask the question is a bit imprecise, I try to answer it anyway. Imprecise, because you have a high amount of wrong assumptions that you obviously take for granted (it happens to the best of us). I hope I can correct this a little:

$s = "THE UTF-8 NO-BREAK\xA0SPACE"; // a non-ASCII byte
preg_match_all('/[-\'\p{L}]+/u',$s,$m);
var_dump($m);            // empty! (corrupted)

This code is wrong. You blame PHP here for not giving a warning or something, but I must admit, the only one to blame here is "you". PHP does allow you to check for the error. Before you judge so early that a warning has to be given in error handling, I have to remind you that there are different ways how to deal with errors. Some dealing is with giving messages, another type of dealing with errors is by telling about them with return values. And if we visit the manual page of preg_match_all and look for the documentation of the return value, we can find this:

Returns the number of full pattern matches (which might be zero), or FALSE if an error occurred.

The part at the end:

FALSE if an error occurred [Highlight by me]

is some common way in error handling to signal the calling code that some error occured. Let's review your code of which you think it does not work:

$s = "THE UTF-8 NO-BREAK\xA0SPACE"; // a non-ASCII byte
preg_match_all('/[-\'\p{L}]+/u',$s,$m);
var_dump($m);            // empty! (corrupted)

The only thing this code shows is that the person who typed it (I guess it was you), clearly decided to not do any error handling. That's fine unless that person as well protests that the code won't work.

The sad thing about this is, that this is a common user-error, if you write fragile code (e.g. without error handling), don't expect it to work in a solid manner. That will never happen.

So what does this require when you program? First of all you should know about the functions you use. That normally requires knowledge about the input parameters and the return values. You find that information normally documented. Use the manual. Second you actually need to care about return values and do the error handling your own. The function alone does not know what it means if an error occured. Is it an exception? Then you need to do the exception handling probably as in the demo example:

<?php
/**
 * @link http://stackoverflow.com/q/19316127/367456
 */

$s = "THE UTF-8 NO-BREAK\xA0SPACE"; // a non-ASCII byte
$result = preg_match_all('/[-\'\p{L}]+/u',$s,$m);

if ($result === FALSE) {
    switch (preg_last_error()) {
        case PREG_BAD_UTF8_ERROR:
            throw new InvalidArgumentException(
                'UTF-8 encoded binary string expected.'
            );
        default:
            throw new RuntimeException('preg error occured.');

    }
}

var_dump($m);            // nothing at all corrupted...

In any case it means you need to look what you do, learn about it and write more code. No magic. No bug. Just a bit of work.

The other part you've in front of you is perhaps to understand what characters in a software are, but that is more independent to concrete programming languages like PHP, for example you can take an introductory read here:

The first is a must read or perhaps must-bookmark, because it is a lot to read but it explains it all very good.

hakre
  • 193,403
  • 52
  • 435
  • 836
  • Thanks, you are right about use of return code... I took the wrong way and got lost *the focus* when I showed a function as solution: there are a *lack of uniformity in the behavior of PHP*. My "manifestation" was about the lack of uniformity and good error messages, sorry, I was not polite with the PHP language, and can edit question/answer texts... But your "manifestation" is being equally exaggerated, and therefore is not being very polite with me. Check the other interpretation of my message: – Peter Krauss Oct 12 '13 at 11:47
  • the `preg_macth` behaviour say that the "whole string is corrupted", but `str_word_count` behaviour not say the same, there is a semantic conflict. Another problem is a "error return" that not say what kind of error are there. – Peter Krauss Oct 12 '13 at 11:48
  • @PeterKrauss: Error handling has many faces, you need to know how each individual function and/or subsystem does handle that. The semantic conflicts you see is a wrong concept because these subsystems are unrelated. E.g. all the preg stuff is a binding to PCRE which is a totally different library then the locale based C string functions large part of PHP's string library has. In PHP you can not expect that all is polished and works the same, has been put through a layer that is semantically cleaned up. You normally do that with the code you write your own to the degree you need it. – hakre Oct 13 '13 at 10:14
  • 1
    @PeterKrauss: I have to apologize as well. I could feel your disappointment but what I wanted to say with the answer is that you should not be too disappointed because large part of PHP is that way. It's normally solved with documentation and helping each other. Just don't confuse the inconsistencies. Some of these have clear reasons and it's better to understand the resons. Some of these don't have any reason and for those it's important to know that there is no reason :) - You will live better with that attitude as a PHP developer :) – hakre Oct 13 '13 at 10:19
  • Thanks (!), we back to the focus of discussion, and now I see that I agree with your positions and explanations. I also completed my answer commenting your tips and adding a `preg_word_count()` function, to illustrate a "programmer's effort for mimetize uniformity" (to overcome difficulties of PHP5)... Lastly, our discussion show another face of the solution: we need the community! A PHP programmer alone (like me when posted the question) is a bad programmer. – Peter Krauss Oct 13 '13 at 13:31