40

I'm looking for a general strategy/advice on how to handle invalid UTF-8 input from users.

Even though my web application uses UTF-8, somehow some users enter invalid characters. This causes errors in PHP's json_encode() and overall seems like a bad idea to have around.

W3C I18N FAQ: Multilingual Forms says "If non-UTF-8 data is received, an error message should be sent back.".

  • How exactly should this be practically done, throughout a site with dozens of different places where data can be input?
  • How do you present the error in a helpful way to the user?
  • How do you temporarily store and display bad form data so the user doesn't lose all their text? Strip bad characters? Use a replacement character, and how?
  • For existing data in the database, when invalid UTF-8 data is detected, should I try to convert it and save it back (how? utf8_encode()? mb_convert_encoding()?), or leave as-is in the database but doing something (what?) before json_encode()?

I'm very familiar with the mbstring extension and am not asking "how does UTF-8 work in PHP?". I'd like advice from people with experience in real-world situations how they've handled this.

As part of the solution, I'd really like to see a fast method to convert invalid characters to U+FFFD.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
philfreo
  • 41,941
  • 26
  • 128
  • 141
  • It doesn't follow the guidelines you linked, but I just replace invalid byte sequences with [U+FFFD](http://www.fileformat.info/info/unicode/char/fffd/index.htm) so I can be done with it. – zildjohn01 Sep 15 '10 at 15:13
  • @zildjohn01, What's the best way to do this (which PHP functions?). Could you leave a detailed answer with your approach? – philfreo Sep 18 '10 at 18:39
  • To be honest, it's not very exciting. I just translated a UTF-8 parser from C to PHP. It scans the string byte by byte, and if an invalid byte sequence is found, it rewrites the string manually. Slow, but portable. – zildjohn01 Sep 18 '10 at 21:03
  • Still would be interested in seeing it if you care to share – philfreo Sep 19 '10 at 04:34
  • I'd really like to see a *fast* method for translating invalid characters to U+FFFD. :) – philfreo Sep 21 '10 at 18:38
  • Unfortunately this method isn't fast (relatively speaking). More unfortunately, I don't have the go-ahead to post it. Why not just start with something like, oh say [this](http://tidy.sourceforge.net/cgi-bin/lxr/source/src/utf8.c), and instead of returning an error on an invalid char, start rewriting the string from the point it fails? (Sorry I can't help) – zildjohn01 Sep 21 '10 at 20:07

9 Answers9

62

The accept-charset="UTF-8" attribute is only a guideline for browsers to follow, and they are not forced to submit that in that way. Crappy form submission bots are a good example...

I usually ignore bad characters, either via iconv() or with the less reliable utf8_encode() / utf8_decode() functions. If you use iconv, you also have the option to transliterate bad characters.

Here is an example using iconv():

$str_ignore = iconv('UTF-8', 'UTF-8//IGNORE', $str);
$str_translit = iconv('UTF-8', 'UTF-8//TRANSLIT', $str);

If you want to display an error message to your users I'd probably do this in a global way instead of a per value received basis. Something like this would probably do just fine:

function utf8_clean($str)
{
    return iconv('UTF-8', 'UTF-8//IGNORE', $str);
}

$clean_GET = array_map('utf8_clean', $_GET);

if (serialize($_GET) != serialize($clean_GET))
{
    $_GET = $clean_GET;
    $error_msg = 'Your data is not valid UTF-8 and has been stripped.';
}

// $_GET is clean!

You may also want to normalize new lines and strip (non-)visible control chars, like this:

function Clean($string, $control = true)
{
    $string = iconv('UTF-8', 'UTF-8//IGNORE', $string);

    if ($control === true)
    {
            return preg_replace('~\p{C}+~u', '', $string);
    }

    return preg_replace(array('~\r\n?~', '~[^\P{C}\t\n]+~u'), array("\n", ''), $string);
}

Code to convert from UTF-8 to Unicode code points:

function Codepoint($char)
{
    $result = null;
    $codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char));

    if (is_array($codepoint) && array_key_exists(1, $codepoint))
    {
        $result = sprintf('U+%04X', $codepoint[1]);
    }

    return $result;
}

echo Codepoint('à'); // U+00E0
echo Codepoint('ひ'); // U+3072

It is probably faster than any other alternative, but I haven't tested it extensively though.


Example:

$string = 'hello world�';

// U+FFFEhello worldU+FFFD
echo preg_replace_callback('/[\p{So}\p{Cf}\p{Co}\p{Cs}\p{Cn}]/u', 'Bad_Codepoint', $string);

function Bad_Codepoint($string)
{
    $result = array();

    foreach ((array) $string as $char)
    {
        $codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char));

        if (is_array($codepoint) && array_key_exists(1, $codepoint))
        {
            $result[] = sprintf('U+%04X', $codepoint[1]);
        }
    }

    return implode('', $result);
}

This may be what you were looking for.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Alix Axel
  • 151,645
  • 95
  • 393
  • 500
  • 1
    Would this method allow you to replace invalid characters with U+FFFD rather than just stripping? It seems like that'd be more helpful so the user sees exactly which chars had a problem. – philfreo Sep 20 '10 at 15:53
  • @philfreo: Not that I know of, not with iconv. But you might get away with regular expressions, something like: `preg_replace('/([^\p{L}\p{M}\p{Z}\p{N}\p{P}\p{S}\p{C}])/u', 'convert_to_unicode_notation("\\1"))', string);` - this is just from the top of my sleepy head, better regexes surely exist out there. Bare in mind that this will be considerably slower than the iconv approach though! – Alix Axel Sep 20 '10 at 23:08
  • 2
    @philfreo: Ok, this one is a must read: http://webcollab.sourceforge.net/unicode.html. – Alix Axel Sep 20 '10 at 23:16
  • Good link. I'd really like to see a *fast* method for translating invalid characters to U+FFFD. – philfreo Sep 21 '10 at 18:37
  • @philfreo: I highly doubt anything substantially faster will be available anytime soon. You could run `iconv()` and if the data has changed use the regex I posted above but wouldn't you then need to check if the transliteration of chars is being submitted and then alert the user (again)? – Alix Axel Sep 24 '10 at 11:20
  • How about something like http://us2.php.net/manual/en/function.utf8-encode.php#97533 but that instead of just testing for UTF8, replaces invalid with U+FFFD – philfreo Sep 24 '10 at 16:59
  • @philfreo: That has to be slower than the regex I've posted before. – Alix Axel Sep 24 '10 at 20:02
  • Ok, for the sake of completeness in your answer, can you include: some code, however slow, that converts invalid to `U+FFFD`, as well as a couple details on why iconv is more reliable than `utf8_encode`? – philfreo Sep 24 '10 at 23:52
  • 1
    @philfreo: Just posted some code to output Unicode code points, I suppose you know where to fit that in the whole picture. Regarding your `utf8_encode` question, the manual page says it all: "encodes **an ISO-8859-1 string** to UTF-8", it throws garbage all the time. `iconv` on the other hand is a mature C library not PHP specific, hence more reliable. – Alix Axel Sep 25 '10 at 01:06
  • @philfreo: "I'd really like to see a fast method to convert invalid characters to U+FFFD". I spend nearly an hour on this, you have to be more explicit in what you are trying to do because I'm not following... – Alix Axel Sep 25 '10 at 01:30
  • Check out http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt - "replace any malformed UTF-8 sequence by a replacement character (U+FFFD), which looks a bit like an inverted question mark, or a similar symbol" - http://www.fileformat.info/info/unicode/char/fffd/index.htm – philfreo Sep 25 '10 at 02:49
  • 1
    so when invalid data is found, rather than just stripping it (`//IGNORE`), the user sees which character was invalid. – philfreo Sep 25 '10 at 02:55
  • I just ran your last code snippet and got the literal text "U+FFFD" rather than having it actually replace the invalid byte sequence with the replacement character that is represented by U+FFFD – philfreo Sep 25 '10 at 04:29
  • @philfreo: That is what `iconv('UTF-8', 'UTF-8//TRANSLIT', $str)` is for. – Alix Axel Sep 25 '10 at 15:01
  • Actually, in testing some invalid utf8 data, translit doesn't actually do that for me. Are you sure? (It also causes an error, so I used //IGNORE//TRANSLIT). Translit just seems to be for things like converting €, stripping accents, etc. It doesn't convert invalid to U+FFFD. – philfreo Oct 08 '10 at 19:15
  • @philfreo: Could you share the invalid data? Also I'm pretty sure `//IGNORE//TRANSLIT` will just count as `//IGNORE`. – Alix Axel Oct 08 '10 at 21:32
  • Sure. Just try outputting this ( http://stackoverflow.com/questions/1301402/example-invalid-utf8-string/3886015#3886015 ). With //IGNORE the invalid characters are stripped. TRANSLIT does nothing in this case (but has an error without also using IGNORE). It seems ideal to replace invalid bytes with U+FFFD rather than stripping so the user can see where the problem is when they look at what was entered. If that happened, then the browser would show the U+FFFD as an upside down question mark and it would also be safe to json_encode(). – philfreo Oct 09 '10 at 17:42
  • 1
    @Yzmir: Would you care to expand on that statement? If you could share some example bogus strings that would be awesome, since it always seems to work in the tests I've made. – Alix Axel May 21 '11 at 01:44
  • 1
    @Alix Axel, So today I tried to reproduce the UTF-8 5-6 byte sequences that would still cause SimpleXmlElement to fail and I couldn't. So I redact my previous comment. Thanks for keeping me honest. – Yzmir Ramirez May 25 '11 at 01:44
  • @Yzmir: Thanks for taking the time to try to reproduce the problem, +1. =) – Alix Axel May 25 '11 at 12:48
  • `~\r[\n]?~` Why is the `\n` in a character range? – alex Sep 28 '11 at 11:25
  • @alex: Only for readability, so that I can easily spot that the `?` operator is being used on the newline and not on `n` by itself. – Alix Axel Sep 28 '11 at 15:24
4

Receiving invalid characters from your web application might have to do with the character sets assumed for HTML forms. You can specify which character set to use for forms with the accept-charset attribute:

<form action="..." accept-charset="UTF-8">

You also might want to take a look at similar questions on Stack Overflow for pointers on how to handle invalid characters, e.g., those in the column to the right, but I think that signaling an error to the user is better than trying to clean up those invalid characters which cause unexpected loss of significant data or unexpected change of your user's inputs.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Arc
  • 11,143
  • 4
  • 52
  • 75
  • It specifies the character sets accepted by the server. I'm not sure whether it is enough to only specify UTF-8 encoding for the page - the browser could display UTF-8 while sending form data in ISO-8859-1 or something else. – Arc Sep 15 '10 at 15:11
  • What does `accept-charset` really do -- is it impossible for a user to submit invalid characters, or only a suggestion? How should I handle bad data if I still receive it server-side? – philfreo Sep 15 '10 at 15:17
  • According to http://stackoverflow.com/questions/3719974/is-there-any-benefit-to-adding-accept-charsetutf-8-to-html-forms-if-the-page this would be unnecessary – philfreo Sep 15 '10 at 19:41
  • I do not use this attribute myself either and have no problems with UTF-8 characters I tested so far. Referring to Pekka's comment to that question, however, the W3C specification really says that *The default value for this attribute is the reserved string "UNKNOWN". User agents **may** interpret this value as the character encoding that was used to transmit the document*, so I'm not really sure how browsers handle that. http://stackoverflow.com/questions/3719974/#comment-3926382 – Arc Sep 15 '10 at 20:15
  • When you encounter bad data, my opinion is that you should notify the user about that and give her the opportunity to revise her input. This way, you avoid confusion and the user could work around this issue. However, it would be interesting to identify the circumstances leading to you receiving invalid data in the first place - is this caused by specific browsers, what headers are sent by client and server, what encoding is set in the browser after the page with the form is loaded etc. – Arc Sep 15 '10 at 20:25
2

I put together a fairly simple class to check if input is in UTF-8 and to run through utf8_encode() as needs be:

class utf8
{

    /**
     * @param array $data
     * @param int $options
     * @return array
     */
    public static function encode(array $data)
    {
        foreach ($data as $key=>$val) {
            if (is_array($val)) {
                $data[$key] = self::encode($val, $options);
            } else {
                if (false === self::check($val)) {
                    $data[$key] = utf8_encode($val);
                }
            }
        }

        return $data;
    }

    /**
     * Regular expression to test a string is UTF8 encoded
     * 
     * RFC3629
     * 
     * @param string $string The string to be tested
     * @return bool
     * 
     * @link http://www.w3.org/International/questions/qa-forms-utf-8.en.php
     */
    public static function check($string)
    {
        return preg_match('%^(?:
            [\x09\x0A\x0D\x20-\x7E]              # ASCII
            | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
            |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
            | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
            |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
            |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
            | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
            |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
            )*$%xs',
            $string);
    }
}

// For example
$data = utf8::encode($_POST);
Nev Stokes
  • 9,051
  • 5
  • 42
  • 44
1

There is a multibyte extension for PHP. See Multibyte String

You should try the mb_check_encoding() function.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Otar
  • 2,561
  • 1
  • 20
  • 24
  • I'm very familiar with the mb extension, as I linked to it in my own question. Comments on this page indicate that this mb_check_encoding() does not really check for bad byte sequences, plus I'm really asking about a general strategy, not how to do one specific part. – philfreo Sep 15 '10 at 14:49
  • What comment is that? Nobody mentions that function that I can see. The purpose of the function is exactly to check for bad byte sequences. There is [one open bug for the function](https://bugs.php.net/bug.php?id=47990), but a comment on that page says it should be closed. – itpastorn Jul 11 '13 at 14:54
1

I recommend merely not allowing garbage to get in. Don't rely on custom functions, which can bog your system down.

Simply walk the submitted data against an alphabet you design. Create an acceptable alphabet string and walk the submitted data, byte by byte, as if it were an array. Push acceptable characters to a new string, and omit unacceptable characters.

The data you store in your database then is data triggered by the user, but not actually user-supplied data.

<?php
    // Build alphabet
    // Optionally, you can remove characters from this array

    $alpha[] = chr(0); // null
    $alpha[] = chr(9); // tab
    $alpha[] = chr(10); // new line
    $alpha[] = chr(11); // tab
    $alpha[] = chr(13); // carriage return

    for ($i = 32; $i <= 126; $i++) {
        $alpha[] = chr($i);
    }

    /* Remove comment to check ASCII ordinals */

    // /*
    // foreach ($alpha as $key => $val) {
    //     print ord($val);
    //     print '<br/>';
    // }
    // print '<hr/>';
    //*/
    //
    // // Test case #1
    //
    // $str = 'afsjdfhasjhdgljhasdlfy42we875y342q8957y2wkjrgSAHKDJgfcv kzXnxbnSXbcv   ' . chr(160) . chr(127) . chr(126);
    //
    // $string = teststr($alpha, $str);
    // print $string;
    // print '<hr/>';
    //
    // // Test case #2
    //
    // $str = '' . '©?™???';
    // $string = teststr($alpha, $str);
    // print $string;
    // print '<hr/>';
    //
    // $str = '©';
    // $string = teststr($alpha, $str);
    // print $string;
    // print '<hr/>';

    $file = 'http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt';
    $testfile = implode(chr(10), file($file));

    $string = teststr($alpha, $testfile);
    print $string;
    print '<hr/>';


    function teststr(&$alpha, &$str) {
        $strlen = strlen($str);
        $newstr = chr(0); // null
        $x = 0;

        if($strlen >= 2) {

            for ($i = 0; $i < $strlen; $i++) {
                $x++;
                if(in_array($str[$i], $alpha)) {
                    // Passed
                    $newstr .= $str[$i];
                }
                else {
                    // Failed
                    print 'Found out of scope character. (ASCII: ' . ord($str[$i]). ')';
                    print '<br/>';
                    $newstr .= '&#65533;';
                }
            }
        }
        elseif($strlen <= 0) {
            // Failed to qualify for test
            print 'Non-existent.';
        }
        elseif($strlen === 1) {
            $x++;
            if(in_array($str, $alpha)) {
                // Passed

                $newstr = $str;
            }
            else {
                // Failed
                print 'Total character failed to qualify.';
                $newstr = '&#65533;';
            }
        }
        else {
            print 'Non-existent (scope).';
        }

        if(mb_detect_encoding($newstr, "UTF-8") == "UTF-8") {
            // Skip
        }
        else {
            $newstr = utf8_encode($newstr);
        }

        // Test encoding:
        if(mb_detect_encoding($newstr, "UTF-8") == "UTF-8") {
            print 'UTF-8 :D<br/>';
        }
        else {
            print 'ENCODED: ' . mb_detect_encoding($newstr, "UTF-8") . '<br/>';
        }

        return $newstr . ' (scope: ' . $x . ', ' . $strlen . ')';
    }
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Geekster
  • 336
  • 1
  • 10
  • 1
    How do you propose doing that, when the "alphabet" is any valid UTF-8 character. – philfreo Sep 20 '10 at 14:32
  • Okay EDIT #1 is updated and should purify anything you want to put into JSON. Of course you can adjust the characters in your alphabet if JSON still chokes. If you could post a sample data file that is choking on JSON that'd help me fine-tune this. – Geekster Sep 21 '10 at 17:10
  • It is now UTF-8 returned, proof. – Geekster Sep 22 '10 at 17:27
  • I have updated it to use the file you provided. Your server will need to have fopen wrappers enabled because I'm reading the URL into file(). Of course if you want you can download the file and read it in from your directory but I'm LAZY. :D – Geekster Sep 22 '10 at 17:35
  • Could you make it simply replace invalid characters with U+FFFD, as that document suggests? – philfreo Sep 25 '10 at 03:54
  • @philfreo: Updated, if you don't want any output just comment out the print rows. – Geekster Sep 27 '10 at 13:50
1

For completeness to this question (not necessarily the best answer)...

function as_utf8($s) {
    return mb_convert_encoding($s, "UTF-8", mb_detect_encoding($s));
}
philfreo
  • 41,941
  • 26
  • 128
  • 141
  • 2
    This is good, but be careful; mb_detect_encoding() isn't always 100% accurate if you don't specify which encodings it should check for. Also, some encodings behave almost identically (e.g., ISO-8859-1/Latin-1 and CP-1252/Windows-1252 — in fact, any single-byte encoding such as KOI8-R, **any** flavor of ISO-8859-*, etc. is practically impossible to detect unless you employ some very clever [and likely computationally expensive] heuristics). –  Feb 16 '12 at 16:25
0

Strip all characters outside your given subset. At least in some parts of my application I would not allow using characters outside the [a-Z] and [0-9] sets, for example in usernames.

You can build a filter function that silently strips all characters outside this range, or that returns an error if it detects them and pushes the decision to the user.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Elzo Valugi
  • 27,240
  • 15
  • 95
  • 114
  • "just ignoring malformed sequences or unavailable characters does not conform to ISO 10646, will make debugging more difficult, and can lead to user confusion." http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt – philfreo Sep 15 '10 at 14:52
  • @philifreo : is what you've linked your homework or just a reference? If it's just a reference, that's because the prof is assigning a homework assignment to students and he is challenging them -- not because there is philosophical relevance to detecting bad encoding. You know the expression "the show must go on"? That applies to programming too and that is why my answer gives you the ability to either strip bad characters or return an error if they are detected. – Geekster Sep 22 '10 at 13:50
0

Try doing what Ruby on Rails does to force all browsers always to post UTF-8 data:

<form accept-charset="UTF-8" action="#{action}" method="post"><div
    style="margin:0;padding:0;display:inline">
    <input name="utf8" type="hidden" value="&#x2713;" />
  </div>
  <!-- form fields -->
</form>

See railssnowman.info or the initial patch for an explanation.

  1. To have the browser sends form-submission data in the UTF-8 encoding, just render the page with a Content-Type header of "text/html; charset=utf-8" (or use a meta http-equiv tag).

  2. To have the browser sends form-submission data in the UTF-8 encoding, even if the user fiddles with the page encoding (browsers let users do that), use accept-charset="UTF-8" in the form.

  3. To have the browser sends form-submission data in the UTF-8 encoding, even if the user fiddles with the page encoding (browsers let users do that), and even if the browser is Internet Explorer and the user switched the page encoding to Korean and entered Korean characters in the form fields, add a hidden input to the form with a value such as &#x2713; which can only be from the Unicode charset (and, in this example, not the Korean charset).

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
yfeldblum
  • 65,165
  • 12
  • 129
  • 169
  • 1
    Does `accept-charset` really force browsers to not send any non-UTF8 data? What happens if they try to? How should I handle it on the server if this client-side validation is bypassed? – philfreo Sep 15 '10 at 15:11
  • Can you explain the hidden field as well - is that necessary? – philfreo Sep 15 '10 at 15:13
  • According to http://stackoverflow.com/questions/3719974/is-there-any-benefit-to-adding-accept-charsetutf-8-to-html-forms-if-the-page this would all be unnecessary – philfreo Sep 15 '10 at 19:42
  • I'm not sure you read that other page correctly.... I edited my answer to include the explanation of what Rails does. – yfeldblum Sep 15 '10 at 22:33
  • This won't help protect against XSS attacks because it's client side. I believe the idea here is to purify the data coming into the system, but you can't rely on HTML flags for that. – Geekster Sep 20 '10 at 13:52
  • If a malicious client throws garbage at the server, it's OK for the server to 400 Bad Request. For well-behaved clients - browsers - use the three tricks above to avoid the server spitting back a 400 Bad Requests because of encoding mismatches. – yfeldblum Sep 20 '10 at 14:04
  • Never rely on clients to have a browser. Think of the bots! And also think of people who use bots legitimately, such as if they do a trackback from their blog, or a pingback. You're not always going to have a browser viewing/submitting to your site. Think also of people with mobile apps that might not have the same constraints as PC browsers. Cleaning of data has to happen server side. You have to assume they are throwing garbage at you. – Geekster Sep 21 '10 at 18:46
  • Don't clean bad data from bots, just error. Cleaning bad data means transforming data in a way that does not preserve the original data just so that your app can pretend it makes sense when it doesn't. You may permit multiple encodings and server-side look at the Content-Type header to determine the charset/encoding used, and do conversions server-side from the known charset/encoding. Bots should not be doing posts, and the scripts that should be doing posts should send data in the correct charset/encoding or any of the charset/encodings that your app supports. – yfeldblum Sep 21 '10 at 19:23
0

Set UTF-8 as the character set for all headers output by your PHP code.

In every PHP output header, specify UTF-8 as the encoding:

header('Content-Type: text/html; charset=utf-8');
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Bhavin Patel
  • 121
  • 2
  • 6