26

I'm using PHP to handle text from a variety of sources. I don't anticipate it will be anything other than UTF-8, ISO 8859-1, or perhaps Windows-1252. If it's anything other than one of those, I just need to make sure the text gets turned into a valid UTF-8 string, even if characters are lost. Does the //TRANSLIT option of iconv solve this?

For example, would this code ensure that a string is safe to insert into a UTF-8 encoded document (or database)?

function make_safe_for_utf8_use($string) {

    $encoding = mb_detect_encoding($string, "UTF-8,ISO-8859-1,WINDOWS-1252");

    if ($encoding != 'UTF-8') {
        return iconv($encoding, 'UTF-8//TRANSLIT', $string);
    }
    else {
        return $string;
    }
}
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Brian
  • 2,107
  • 6
  • 22
  • 40

5 Answers5

39

UTF-8 can store any Unicode character. If your encoding is anything else at all, including ISO-8859-1 or Windows-1252, UTF-8 can store every character in it. So you don't have to worry about losing any characters when you convert a string from any other encoding to UTF-8.

Further, both ISO-8859-1 and Windows-1252 are single-byte encodings where any byte is valid. It is not technically possible to distinguish between them. I would chose Windows-1252 as your default match for non-UTF-8 sequences, as the only bytes that decode differently are the range 0x80-0x9F. These decode to various characters like smart quotes and the Euro in Windows-1252, whereas in ISO-8859-1 they are invisible control characters which are almost never used. Web browsers may sometimes say they are using ISO-8859-1, but often they will really be using Windows-1252.

would this code ensure that a string is safe to insert into a UTF-8 encoded document

You would certainly want to set the optional ‘strict’ parameter to TRUE for this purpose. But I'm not sure this actually covers all invalid UTF-8 sequences. The function does not claim to check a byte sequence for UTF-8 validity explicitly. There have been known cases where mb_detect_encoding would guess UTF-8 incorrectly before, though I don't know if that can still happen in strict mode.

If you want to be sure, do it yourself using the W3-recommended regex:

if (preg_match('%^(?:
      [\x09\x0A\x0D\x20-\x7E]            # ASCII
    | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
    | \xE0[\xA0-\xBF][\x80-\xBF]         # excluding overlongs
    | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
    | \xED[\x80-\x9F][\x80-\xBF]         # excluding surrogates
    | \xF0[\x90-\xBF][\x80-\xBF]{2}      # planes 1-3
    | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
    | \xF4[\x80-\x8F][\x80-\xBF]{2}      # plane 16
)*$%xs', $string))
    return $string;
else
    return iconv('CP1252', 'UTF-8', $string);
Haim Evgi
  • 123,187
  • 45
  • 217
  • 223
bobince
  • 528,062
  • 107
  • 651
  • 834
  • Thanks very much. I know developers always comment on the slowness of regexes - how careful should I be using this in big loops with lots of text? For example, a loop that iterates 200 times and cleanses text of 10,000 characters on each iteration. – Brian Oct 06 '09 at 15:56
  • Whilst I'm not a fan of regex, in this case it shouldn't be that bad. Regex gets slow when you have successive or nested `?`/`*`/`+` sequences that can cause it to have to backtrack looking for different ways to match. That won't happen in this case. – bobince Oct 06 '09 at 16:09
  • Excellent. So when using iconv as you describe above, if I specify CP1252 as the input charset, and the string is something other than CP1252 or ISO-8859-1, it will return a UTF-8 safe string, although some characters may be lost. Is that correct? – Brian Oct 06 '09 at 16:25
  • 1
    It will return a UTF-8-safe string, yes. Non-ASCII characters will come as the wrong characters, but not dangerous ones. – bobince Oct 06 '09 at 18:36
  • 2
    Actually, this regex is wrong. It will fail to match valid UTF-8 code points (such as `chr(0)`). It's fine for printable characters, but not generic UTF-8... – ircmaxell Jul 29 '12 at 13:02
  • It mightn't match all valid UTF-8 encodings but it will match against UTF-8 encodings that are valid in XML. – CJ Dennis Feb 27 '13 at 05:10
  • Note that this answer will cause issues for many situations because of the complex regex causing PCRE to crash: https://bugs.php.net/bug.php?id=36463 . It's correct, but it doesn't work sometimes. didn't work for me, use ini_set('mbstring.substitute_character', "none"); $utf8_string = mb_convert_encoding($string, 'UTF-8', 'UTF-8'); – redreinard Mar 14 '16 at 21:13
  • @redreinard: wow, that's... surprising. Although the expression looks tricky it is in fact very simple from a regex point of view—there are no advanced features and no possibility of backtracking; no recursion should be needed. There's a comment on that bug saying even `^(a)+$` fails for 203-byte input... surely this can't be expected/acceptable behaviour? It seems to work fine in R (which also uses PCRE), for what it's worth. I think Rasmus is ignoring a real problem. :-( – bobince Mar 15 '16 at 08:30
  • Also experiencing the issue with this, seems to fail on anything moderately sizable, like the HTML of a modern web page – Brian Leishman Oct 04 '17 at 15:42
19

With the mbstring library, you have mb_check_encoding().

Example of use:

mb_check_encoding($string, 'UTF-8');

However, with PHP 7.1.9 on a recent Windows 10 system, the regex solution now outperforms mb_check_encoding() for any string length (tested on 20,000 iterations):

  • 10 characters: regex => 4 ms, mb_check_encoding() => 64 ms
  • 10000 chars: regex => 125 ms, mb_check_encoding() => 2.4 s
Maxime Pacary
  • 22,336
  • 11
  • 85
  • 113
7

Just a note: Instead of using the often recommended (rather complex) regular expression by W3C, you can simply use the 'u' modifier to test a string for UTF-8 validity:

<?php
  if (preg_match("//u", $string)) {
      // $string is valid UTF-8
  }
eyecatchUp
  • 10,032
  • 4
  • 55
  • 65
  • also back in the days: [How to detect if have to apply utf8 decode or encode on a string?](http://stackoverflow.com/a/4407996/367456) – hakre Oct 10 '14 at 13:53
  • 1
    Easy common-case check, but not completely reliable. It's behavior depends on PHP version, but more importantly, it allows invalid multibyte sequences. http://www.phpwact.org/php/i18n/charsets#checking_utf-8_for_well_formedness – Stephen M. Harris Dec 12 '14 at 16:23
1

Answer to "iconv is idempotent":

Neither is iconv - iconv is not idempotent.

A big difference between utf8_encode() and iconv() is that iconv may raise errors like this "Detected an incomplete multibyte character in input string", even with:

iconv('ISO-8859-1', 'UTF-8'.'//IGNORE', $str)

in the above code:

$encoding = mb_detect_encoding($string, "UTF-8,ISO-8859-1,WINDOWS-1252");

You have to know mb_detect_encoding. It can answer about uft-8 even for invalid UTF-8 strings (badly formed UTF-8).

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Nadir
  • 695
  • 8
  • 12
0

Have a look at http://www.phpwact.org/php/i18n/charsets for a guide about character sets. This page links to a page specifically for UTF-8.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Martijn
  • 5,471
  • 4
  • 37
  • 50