32

The iconv function sometimes gives me an error:

Notice:
iconv() [function.iconv]:
Detected an incomplete multibyte character in input string in [...]

Is there a way to detect that there are illegal characters in a UTF-8 string before sending data to inconv()?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
rsk82
  • 28,217
  • 50
  • 150
  • 240
  • Meanwhile I found this: http://stackoverflow.com/questions/4407854/how-to-detect-if-have-to-apply-utf8-decode-or-encode-on-a-string – rsk82 Jul 17 '11 at 11:50

5 Answers5

74

First, note that it is not possible to detect whether text belongs to a specific undesired encoding. You can only check whether a string is valid in a given encoding.

You can make use of the UTF-8 validity check that is available in preg_match [PHP Manual] since PHP 4.3.5. It will return 0 (with no additional information) if an invalid string is given:

$isUTF8 = preg_match('//u', $string);

Another possibility is mb_check_encoding [PHP Manual]:

$validUTF8 = mb_check_encoding($string, 'UTF-8');

Another function you can use is mb_detect_encoding [PHP Manual]:

$validUTF8 = ! (false === mb_detect_encoding($string, 'UTF-8', true));

It's important to set the strict parameter to true.

Additionally, iconv [PHP Manual] allows you to change/drop invalid sequences on the fly. (However, if iconv encounters such a sequence, it generates a notification; this behavior cannot be changed.)

echo 'TRANSLIT : ', iconv("UTF-8", "ISO-8859-1//TRANSLIT", $string), PHP_EOL;
echo 'IGNORE   : ', iconv("UTF-8", "ISO-8859-1//IGNORE", $string), PHP_EOL;

You can use @ and check the length of the return string:

strlen($string) === strlen(@iconv('UTF-8', 'UTF-8//IGNORE', $string));

Check the examples on the iconv manual page as well.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
hakre
  • 193,403
  • 52
  • 435
  • 836
  • 1
    you're not wrong but seems that `preg_match('!.!u', $str)` does the trick - it silently check if str is utf-8 before attempting to find anything. - that dot in regexp is not even needed – rsk82 Jul 17 '11 at 12:30
  • @user393087: I've made a slight edit to make the `preg_match` method work correctly on empty strings as well. – hakre Jul 17 '11 at 13:34
  • 2
    @hakre: Thanks for the nice content. – Alan Mar 28 '15 at 20:15
  • 3
    Good overview of all the options! I wrote a [micro-benchmark](https://github.com/mindplay-dk/benchpress/blob/master/example4.php) to see which is faster - `preg_match()` appears to be the fastest overall (under PHP 7) both for valid/invalid and short/long strings. – mindplay.dk Jul 20 '16 at 19:04
  • Great answer! It's worth pointing out that you can also check the scripts the unicode characters belong to, so this string "Àbcdeឃ" will fail to match /^\p{Latin}*$/u but will match /^[\p{Latin}\p{Khmer}]*$/u. The [php manual](http://www.php.net/manual/en/regexp.reference.unicode.php) lists the various supported scripts. – wordragon Sep 10 '18 at 03:45
  • When I run the preg method I am getting `false` with `preg_last_error() === PREG_BAD_UTF8_ERROR`. The documentation isn't really clear about this but any chance the behaviour has been changed recently? – Bananaapple Jun 25 '19 at 08:44
  • 1
    @Bananaapple: preg_... functions in PHP make use of the PCRE library, and in case of this error, the PHP constant reflects the error code from the PCRE library (PCRE_ERROR_BADUTF8). [PHP 7.3 got a migration from PCRE to PCRE2](https://wiki.php.net/rfc/pcre2-migration) under the hood. Regardless if that is the version of PHP you're using, you can check the PCRE library version in use by checking the PCRE_VERSION constant and/or phpinfo(). This might reveal changes. For technical details check https://www.pcre.org/current/doc/html/pcre2unicode.html for "Errors in UTF-8 strings ". – hakre Jun 25 '19 at 18:16
1

For the one use json_encode, try json_last_error

<?php
// An invalid UTF8 sequence
$text = "\xB1\x31";

$json  = json_encode($text);
$error = json_last_error();

var_dump($json, $error === JSON_ERROR_UTF8);

output (e.g. for PHP versions 5.3.3 - 5.3.13, 5.3.15 - 5.3.29, 5.4.0 - 5.4.45)

string(4) "null"
bool(true)
hakre
  • 193,403
  • 52
  • 435
  • 836
xuhuaiqu
  • 41
  • 5
  • I'm a bit curious, on which distro / php compile did you run this back in november last year? -- not a trick question. – hakre Feb 11 '23 at 22:29
0

The specification on which characters that are invalid in UTF-8 is pretty clear. You probably want to strip those out before trying to parse it. They shouldn't be there, so if you could avoid it even before generating the XML that would be even better.

See here for a reference:

http://www.w3.org/TR/xml/#charsets

That isn't a complete list. Many parsers also disallow some low-numbered control characters, but I can't find a comprehensive list right now.

However, iconv might have built-in support for this:

http://www.zeitoun.net/articles/clear-invalid-utf8/start

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
jishi
  • 24,126
  • 6
  • 49
  • 75
0

You could try using mb_detect_encoding to detect if you've got a different character set (than UTF-8) then mb_convert_encoding to convert to UTF-8 if required. It's more likely that people are giving you valid content in a different character set than giving you invalid UTF-8.

Robin
  • 4,242
  • 1
  • 20
  • 20
  • Be aware that valid ASCII strings are also valid UTF8 strings. This means mb_detect_encoding will return "ASCII" for any string that's a valid UTF8 string and which doesn't contain any Unicode characters – GordonM Jan 16 '17 at 15:21
0

Put an @ in front of iconv() to suppress the NOTICE and an //IGNORE after UTF-8 in the source encoding id to ignore invalid characters:

@iconv('UTF-8//IGNORE', $destinationEncoding, $yourString);
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
nobody
  • 10,599
  • 4
  • 26
  • 43
  • I know how to ignore it, I don't know how to detect it, I don't want to pass it silently down my code. – rsk82 Jul 17 '11 at 11:54
  • btw the `preg_match()` solution in the other question is very interesting I'd go with that. – nobody Jul 17 '11 at 12:04