Ensuring valid UTF-8 in PHP

Question

I'm using PHP to handle text from a variety of sources. I don't anticipate it will be anything other than UTF-8, ISO 8859-1, or perhaps Windows-1252. If it's anything other than one of those, I just need to make sure the text gets turned into a valid UTF-8 string, even if characters are lost. Does the //TRANSLIT option of iconv solve this?

For example, would this code ensure that a string is safe to insert into a UTF-8 encoded document (or database)?

function make_safe_for_utf8_use($string) {

    $encoding = mb_detect_encoding($string, "UTF-8,ISO-8859-1,WINDOWS-1252");

    if ($encoding != 'UTF-8') {
        return iconv($encoding, 'UTF-8//TRANSLIT', $string);
    }
    else {
        return $string;
    }
}

score 39 · Accepted Answer · edited Dec 11 '11 at 11:54

39

UTF-8 can store any Unicode character. If your encoding is anything else at all, including ISO-8859-1 or Windows-1252, UTF-8 can store every character in it. So you don't have to worry about losing any characters when you convert a string from any other encoding to UTF-8.

Further, both ISO-8859-1 and Windows-1252 are single-byte encodings where any byte is valid. It is not technically possible to distinguish between them. I would chose Windows-1252 as your default match for non-UTF-8 sequences, as the only bytes that decode differently are the range 0x80-0x9F. These decode to various characters like smart quotes and the Euro in Windows-1252, whereas in ISO-8859-1 they are invisible control characters which are almost never used. Web browsers may sometimes say they are using ISO-8859-1, but often they will really be using Windows-1252.

would this code ensure that a string is safe to insert into a UTF-8 encoded document

You would certainly want to set the optional ‘strict’ parameter to TRUE for this purpose. But I'm not sure this actually covers all invalid UTF-8 sequences. The function does not claim to check a byte sequence for UTF-8 validity explicitly. There have been known cases where mb_detect_encoding would guess UTF-8 incorrectly before, though I don't know if that can still happen in strict mode.

If you want to be sure, do it yourself using the W3-recommended regex:

if (preg_match('%^(?:
      [\x09\x0A\x0D\x20-\x7E]            # ASCII
    | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
    | \xE0[\xA0-\xBF][\x80-\xBF]         # excluding overlongs
    | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
    | \xED[\x80-\x9F][\x80-\xBF]         # excluding surrogates
    | \xF0[\x90-\xBF][\x80-\xBF]{2}      # planes 1-3
    | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
    | \xF4[\x80-\x8F][\x80-\xBF]{2}      # plane 16
)*$%xs', $string))
    return $string;
else
    return iconv('CP1252', 'UTF-8', $string);

edited Dec 11 '11 at 11:54

Haim Evgi

123,187
45
217
223

answered Oct 06 '09 at 04:16

bobince

528,062
107
651
834

Thanks very much. I know developers always comment on the slowness of regexes - how careful should I be using this in big loops with lots of text? For example, a loop that iterates 200 times and cleanses text of 10,000 characters on each iteration. – Brian Oct 06 '09 at 15:56
Whilst I'm not a fan of regex, in this case it shouldn't be that bad. Regex gets slow when you have successive or nested `?`/`*`/`+` sequences that can cause it to have to backtrack looking for different ways to match. That won't happen in this case. – bobince Oct 06 '09 at 16:09
Excellent. So when using iconv as you describe above, if I specify CP1252 as the input charset, and the string is something other than CP1252 or ISO-8859-1, it will return a UTF-8 safe string, although some characters may be lost. Is that correct? – Brian Oct 06 '09 at 16:25
1

It will return a UTF-8-safe string, yes. Non-ASCII characters will come as the wrong characters, but not dangerous ones. – bobince Oct 06 '09 at 18:36
2

Actually, this regex is wrong. It will fail to match valid UTF-8 code points (such as `chr(0)`). It's fine for printable characters, but not generic UTF-8... – ircmaxell Jul 29 '12 at 13:02
It mightn't match all valid UTF-8 encodings but it will match against UTF-8 encodings that are valid in XML. – CJ Dennis Feb 27 '13 at 05:10
Note that this answer will cause issues for many situations because of the complex regex causing PCRE to crash: https://bugs.php.net/bug.php?id=36463 . It's correct, but it doesn't work sometimes. didn't work for me, use ini_set('mbstring.substitute_character', "none"); $utf8_string = mb_convert_encoding($string, 'UTF-8', 'UTF-8'); – redreinard Mar 14 '16 at 21:13
@redreinard: wow, that's... surprising. Although the expression looks tricky it is in fact very simple from a regex point of view—there are no advanced features and no possibility of backtracking; no recursion should be needed. There's a comment on that bug saying even `^(a)+$` fails for 203-byte input... surely this can't be expected/acceptable behaviour? It seems to work fine in R (which also uses PCRE), for what it's worth. I think Rasmus is ignoring a real problem. :-( – bobince Mar 15 '16 at 08:30
Also experiencing the issue with this, seems to fail on anything moderately sizable, like the HTML of a modern web page – Brian Leishman Oct 04 '17 at 15:42

Maxime Pacary · Answer 2 · 2022-11-10T13:51:33.980

19

With the mbstring library, you have mb_check_encoding().

Example of use:

mb_check_encoding($string, 'UTF-8');

However, with PHP 7.1.9 on a recent Windows 10 system, the regex solution now outperforms mb_check_encoding() for any string length (tested on 20,000 iterations):

10 characters: regex => 4 ms, mb_check_encoding() => 64 ms
10000 chars: regex => 125 ms, mb_check_encoding() => 2.4 s

edited Nov 10 '22 at 13:51

answered Nov 21 '11 at 17:33

Maxime Pacary

22,336
11
85
113

Your system must be screaming fast, because I get ~5 seconds on 7500 iterations on a pretty modern system (although I am dealing with some pretty large strings, think the HTML of a fairly modern website). – Brian Leishman Oct 04 '17 at 15:50
What is "the regex solution"? – AndreKR Nov 10 '22 at 03:15
bobince's solution – Maxime Pacary Nov 10 '22 at 13:50

score 7 · Answer 3 · answered Jun 11 '13 at 10:45

7

Just a note: Instead of using the often recommended (rather complex) regular expression by W3C, you can simply use the 'u' modifier to test a string for UTF-8 validity:

<?php
  if (preg_match("//u", $string)) {
      // $string is valid UTF-8
  }

answered Jun 11 '13 at 10:45

eyecatchUp

10,032
4
55
65

also back in the days: [How to detect if have to apply utf8 decode or encode on a string?](http://stackoverflow.com/a/4407996/367456) – hakre Oct 10 '14 at 13:53
1

Easy common-case check, but not completely reliable. It's behavior depends on PHP version, but more importantly, it allows invalid multibyte sequences. http://www.phpwact.org/php/i18n/charsets#checking_utf-8_for_well_formedness – Stephen M. Harris Dec 12 '14 at 16:23

score 1 · Answer 4 · edited Jul 08 '19 at 13:47

Answer to "iconv is idempotent":

Neither is iconv - iconv is not idempotent.

A big difference between utf8_encode() and iconv() is that iconv may raise errors like this "Detected an incomplete multibyte character in input string", even with:

iconv('ISO-8859-1', 'UTF-8'.'//IGNORE', $str)

in the above code:

$encoding = mb_detect_encoding($string, "UTF-8,ISO-8859-1,WINDOWS-1252");

You have to know mb_detect_encoding. It can answer about uft-8 even for invalid UTF-8 strings (badly formed UTF-8).

score 0 · Answer 5 · edited Jul 08 '19 at 13:45

0

Have a look at http://www.phpwact.org/php/i18n/charsets for a guide about character sets. This page links to a page specifically for UTF-8.

edited Jul 08 '19 at 13:45

Peter Mortensen

30,738
21
105
131

answered Oct 06 '09 at 06:19

Martijn

5,471
4
37
50

1

The link seems to be broken. – Peter Mortensen Jul 08 '19 at 13:45

Ensuring valid UTF-8 in PHP

5 Answers5

Linked