Regex to detect invalid UTF-8 string

Question

In PHP, we can use mb_check_encoding() to determine if a string is valid UTF-8. But that's not a portable solution as it requires the mbstring extension to be compiled in and enabled. Additionally, it won't tell us which character is invalid.

Is there a regular expression (or another other 100% portable method) that can match invalid UTF-8 bytes in a given string?

That way, those bytes can be replaced if needed (keeping the binary information, such as when building a test output XML file that includes binary data). So converting the characters to UTF-8 would lose information. So, we may want to convert:

"foo" . chr(128) . chr(255)

Into

"foo<128><255>"

So just "detecting" that the string is not good enough, we'd need to be able to detect which characters are invalid.

score 41 · Accepted Answer · edited Mar 16 '23 at 07:57

You can use this PCRE regular expression to check for byte sequences in a string that are not valid UTF-8. If the regex matches, the string contains invalid byte sequences. It's 100% portable because it doesn't rely on PCRE_UTF8 to be compiled in.

$regex = '/(
    [\xC0-\xC1] # Invalid UTF-8 Bytes
    | [\xF5-\xFF] # Invalid UTF-8 Bytes
    | \xE0[\x80-\x9F] # Overlong encoding of prior code point
    | \xF0[\x80-\x8F] # Overlong encoding of prior code point
    | [\xC2-\xDF](?![\x80-\xBF]) # Invalid UTF-8 Sequence Start
    | [\xE0-\xEF](?![\x80-\xBF]{2}) # Invalid UTF-8 Sequence Start
    | [\xF0-\xF4](?![\x80-\xBF]{3}) # Invalid UTF-8 Sequence Start
    | (?<=[\x00-\x7F\xF5-\xFF])[\x80-\xBF] # Invalid UTF-8 Sequence Middle
    | (?<![\xC2-\xDF]|[\xE0-\xEF]|[\xE0-\xEF][\x80-\xBF]|[\xF0-\xF4]|[\xF0-\xF4][\x80-\xBF]|[\xF0-\xF4][\x80-\xBF]{2})[\x80-\xBF] # Overlong Sequence
    | (?<=[\xE0-\xEF])[\x80-\xBF](?![\x80-\xBF]) # Short 3 byte sequence
    | (?<=[\xF0-\xF4])[\x80-\xBF](?![\x80-\xBF]{2}) # Short 4 byte sequence
    | (?<=[\xF0-\xF4][\x80-\xBF])[\x80-\xBF](?![\x80-\xBF]) # Short 4 byte sequence (2)
)/x';

We can test it by creating a few variations of text:

// Overlong encoding of code point 0
$text = chr(0xC0) . chr(0x80);
var_dump(preg_match($regex, $text)); // int(1)
// Overlong encoding of 5 byte encoding
$text = chr(0xF8) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80);
var_dump(preg_match($regex, $text)); // int(1)
// Overlong encoding of 6 byte encoding
$text = chr(0xFC) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80);        
var_dump(preg_match($regex, $text)); // int(1)
// High code-point without trailing characters
$text = chr(0xD0) . chr(0x01);
var_dump(preg_match($regex, $text)); // int(1)

etc...

In fact, since this matches invalid bytes, you could then use it in preg_replace to replace them away:

preg_replace($regex, '', $text); // Remove all invalid UTF-8 code-points

@hakre: except that it depends on compile time options (PCRE_UTF8). So it's not portable... — ircmaxell, Jul 29 '12 at 13:03
@Jack: It's an extension, you can compile PHP without the PCRE extension. https://github.com/php/php-src/tree/PHP-5.4/ext/pcre and `--without-pcre-regex` switch — hakre, Nov 01 '12 at 13:13
It's perhaps worth to change the suggestion at the end to remove invalid sequences and instead replace them with U+FFFD `"\xEF\xBF\xBD"`, see http://www.unicode.org/reports/tr36/#Ill-Formed_Subsequences — hakre, Aug 06 '15 at 20:25

score 11 · Answer 2 · edited Jul 30 '21 at 23:00

Assuming PHP is compiled with PCRE, it most often is also enabled with UTF-8. So as explicitly asked for in the question, this very simple regular expression can detect invalid UTF-8 strings, because those won't match:

preg_match('//u', $string);

You can then argue that the u modifier (PCRE_UTF8) is not always available, and true, this can happen as the this question shows:

What is the preg_match_all u flag dependent on?

However, in my practical developer life, this never was an issue. It is more an issue that the PCRE extension is not available at all, which would render any answer containing PCRE as useless (even mine here). But most often that issue was more an issue of the past as of today minus some years.

A more lengthy answer similar to this one has been given in the somehow duplicate question:

How can I detect a malformed UTF-8 string in PHP?

So I think this question should highlight more of the benefits the suggested answer ships with.

Is it perhaps the PHP apache module and apache is not compiled with PCRE UTF-8 support? — hakre, Jan 26 '14 at 10:09

score 5 · Answer 3 · edited Mar 16 '23 at 07:52

5

The W3C has a page (titled Multilingual form encoding) that lists the following Perl regular expression which matches a valid UTF-8 string.

(Note that this is the opposite of the regex listed in another answer to this SO question which matches an invalid UTF-8 string.)

#  Returns true if $field is UTF-8, and false otherwise.

$field =~
  m/\A(
     [\x00-\x7F]                        # ASCII
   | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
   |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
   | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
   |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
   |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
   | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
   |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
  )*\z/x;

edited Mar 16 '23 at 07:52

zgpmax

2,777
15
22

answered Dec 29 '13 at 17:09

Todd Ditchendorf

11,217
14
69
123

2

This regex doesn't match valid ASCII (control chars) `[\x09\x0A\x0D\x20-\x7E]` should be [\x00-\x7F] – Brad Kent May 23 '17 at 01:45
1

@BradKent Indeed, the W3C page actually says `[\x00-\x7F]` and only shows the more restricted set as a note at the end of the prose. I have edited the answer – zgpmax Mar 16 '23 at 07:54

score 0 · Answer 4 · edited Jul 30 '21 at 23:11

0

This works for me for detecting Unicode characters, linke emoji, Russian or Chinese:

private function has_unicode($string)
{
    $pattern = '/^.*[^\x{00}-\x{00FF}]+.*$/u';
    return preg_match($pattern, $string) ? true : false;
}

edited Jul 30 '21 at 23:11

Peter Mortensen

30,738
21
105
131

answered Aug 04 '20 at 21:46

togobites

161
1
5

Regex to detect invalid UTF-8 string

4 Answers4

Linked