Okay, I can somewhat feel your disappointment that things didn't worked easily out switching from str_word_count
to preg_match_all
. However the way you ask the question is a bit imprecise, I try to answer it anyway. Imprecise, because you have a high amount of wrong assumptions that you obviously take for granted (it happens to the best of us). I hope I can correct this a little:
$s = "THE UTF-8 NO-BREAK\xA0SPACE"; // a non-ASCII byte
preg_match_all('/[-\'\p{L}]+/u',$s,$m);
var_dump($m); // empty! (corrupted)
This code is wrong. You blame PHP here for not giving a warning or something, but I must admit, the only one to blame here is "you". PHP does allow you to check for the error. Before you judge so early that a warning has to be given in error handling, I have to remind you that there are different ways how to deal with errors. Some dealing is with giving messages, another type of dealing with errors is by telling about them with return values. And if we visit the manual page of preg_match_all
and look for the documentation of the return value, we can find this:
Returns the number of full pattern matches (which might be zero), or FALSE if an error occurred.
The part at the end:
FALSE if an error occurred [Highlight by me]
is some common way in error handling to signal the calling code that some error occured. Let's review your code of which you think it does not work:
$s = "THE UTF-8 NO-BREAK\xA0SPACE"; // a non-ASCII byte
preg_match_all('/[-\'\p{L}]+/u',$s,$m);
var_dump($m); // empty! (corrupted)
The only thing this code shows is that the person who typed it (I guess it was you), clearly decided to not do any error handling. That's fine unless that person as well protests that the code won't work.
The sad thing about this is, that this is a common user-error, if you write fragile code (e.g. without error handling), don't expect it to work in a solid manner. That will never happen.
So what does this require when you program? First of all you should know about the functions you use. That normally requires knowledge about the input parameters and the return values. You find that information normally documented. Use the manual. Second you actually need to care about return values and do the error handling your own. The function alone does not know what it means if an error occured. Is it an exception? Then you need to do the exception handling probably as in the demo example:
<?php
/**
* @link http://stackoverflow.com/q/19316127/367456
*/
$s = "THE UTF-8 NO-BREAK\xA0SPACE"; // a non-ASCII byte
$result = preg_match_all('/[-\'\p{L}]+/u',$s,$m);
if ($result === FALSE) {
switch (preg_last_error()) {
case PREG_BAD_UTF8_ERROR:
throw new InvalidArgumentException(
'UTF-8 encoded binary string expected.'
);
default:
throw new RuntimeException('preg error occured.');
}
}
var_dump($m); // nothing at all corrupted...
In any case it means you need to look what you do, learn about it and write more code. No magic. No bug. Just a bit of work.
The other part you've in front of you is perhaps to understand what characters in a software are, but that is more independent to concrete programming languages like PHP, for example you can take an introductory read here:
The first is a must read or perhaps must-bookmark, because it is a lot to read but it explains it all very good.