46

Is there a way to keep json_encode() from returning null for a string that contains an invalid (non-UTF-8) character?

It can be a pain in the ass to debug in a complex system. It would be much more fitting to actually see the invalid character, or at least have it omitted. As it stands, json_encode() will silently drop the entire string.

Example (in UTF-8):

$string = 
  array(utf8_decode("Düsseldorf"), // Deliberately produce broken string
        "Washington",
        "Nairobi"); 

print_r(json_encode($string));

Results in

[null,"Washington","Nairobi"]

Desired result:

["D�sseldorf","Washington","Nairobi"]

Note: I am not looking to make broken strings work in json_encode(). I am looking for ways to make it easier to diagnose encoding errors. A null string isn't helpful for that.

Pekka
  • 442,112
  • 142
  • 972
  • 1,088
  • Is the string `"Düsseldorf"` invalid only when you `utf8_decode()` it? – Matt Ball Jan 11 '11 at 23:22
  • @Matt no, that was just an example to create a broken string for answerers to test – Pekka Jan 11 '11 at 23:25
  • So you’re getting some JSON data that may include invalid UTF-8 strings? – Gumbo Jan 11 '11 at 23:27
  • @Gumbo yup, that might happen. It just took me an hour to find out that a wrongly encoded text file was the problem. I'm looking for a way to recognize the broken encoding at once next time (i.e. `D�sseldorf`) – Pekka Jan 11 '11 at 23:28
  • @Pekka: Well, you could use regular expressions to validate it first. – Gumbo Jan 11 '11 at 23:31
  • 1
    I just write a wrapper for my json decoder that checks the string first using mb_detect_encoding($str). – cjimti Jan 11 '11 at 23:31
  • Gumbo yeah, I may have to fall back on that. It would be nice to be able to tweak `json_encode()` somehow but I don't see any settings to do that @cjimti interesting idea. – Pekka Jan 11 '11 at 23:32
  • Wait – are we talking about `json_encode` or `json_decode`? – Gumbo Jan 11 '11 at 23:34
  • @Gumbo `en` code in this case – Pekka Jan 11 '11 at 23:35
  • @Pekka: Then I’m afraid that you have to writer your own JSON generator that can deal with invalid UTF sequences. – Gumbo Jan 11 '11 at 23:46
  • @Gumbo yeah, I'm beginning to fear the same. Yuck! – Pekka Jan 11 '11 at 23:51
  • 1
    There is a json_encode() implementation in http://upgradephp.berlios.de/ - it doesn't care much about the charset in the first place. But I guess the one from ZendF could be adapted as easily. – mario Jan 12 '11 at 00:02
  • You are lucky. My `json_encode` returns `false` if there is a wrong character in any place of encoded array. – Finesse May 29 '17 at 05:24

8 Answers8

52

php does try to spew an error, but only if you turn display_errors off. This is odd because the display_errors setting is only meant to control whether or not errors are printed to standard output, not whether or not an error is triggered. I want to emphasize that when you have display_errors on, even though you may see all kinds of other php errors, php doesn't just hide this error, it will not even trigger it. That means it will not show up in any error logs, nor will any custom error_handlers get called. The error just never occurs.

Here's some code that demonstrates this:

error_reporting(-1);//report all errors
$invalid_utf8_char = chr(193);

ini_set('display_errors', 1);//display errors to standard output
var_dump(json_encode($invalid_utf8_char));
var_dump(error_get_last());//nothing

ini_set('display_errors', 0);//do not display errors to standard output
var_dump(json_encode($invalid_utf8_char));
var_dump(error_get_last());// json_encode(): Invalid UTF-8 sequence in argument

That bizarre and unfortunate behavior is related to this bug https://bugs.php.net/bug.php?id=47494 and a few others, and doesn't look like it will ever be fixed.

workaround:

Cleaning the string before passing it to json_encode may be a workable solution.

$stripped_of_invalid_utf8_chars_string = iconv('UTF-8', 'UTF-8//IGNORE', $orig_string);
if ($stripped_of_invalid_utf8_chars_string !== $orig_string) {
    // one or more chars were invalid, and so they were stripped out.
    // if you need to know where in the string the first stripped character was, 
    // then see http://stackoverflow.com/questions/7475437/find-first-character-that-is-different-between-two-strings
}
$json = json_encode($stripped_of_invalid_utf8_chars_string);

http://php.net/manual/en/function.iconv.php

The manual says

//IGNORE silently discards characters that are illegal in the target charset.

So by first removing the problematic characters, in theory json_encode() shouldnt get anything it will choke on and fail with. I haven't verified that the output of iconv with the //IGNORE flag is perfectly compatible with json_encodes notion of what valid utf8 characters are, so buyer beware...as there may be edge cases where it still fails. ugh, I hate character set issues.

Edit
in php 7.2+, there seems to be some new flags for json_encode: JSON_INVALID_UTF8_IGNORE and JSON_INVALID_UTF8_SUBSTITUTE
There's not much documentation yet, but for now, this test should help you understand expected behavior: https://github.com/php/php-src/blob/master/ext/json/tests/json_encode_invalid_utf8.phpt

And, in php 7.3+ there's the new flag JSON_THROW_ON_ERROR. See http://php.net/manual/en/class.jsonexception.php

goat
  • 31,486
  • 7
  • 73
  • 96
  • Interesting and sounds weird! I'll look into this tomorrow. A warning would be enough for me – Pekka Jan 11 '11 at 23:54
  • the iconv() idea looks intriguing and might just work. I'll try that tomorrow as well. – Pekka Jan 12 '11 at 00:04
  • This worked for me. I am `iconv()` ing the data now before json_encoding it. – Pekka Jan 13 '11 at 19:55
  • 1
    @Pekka: I just came across this, and it's probably not relevant now, but `utf8_encode`ing everything in the array works instead of using `iconv`. – Ry- Mar 05 '12 at 22:07
  • @minitech thanks! The core of the issue in this specific case however is how json_encode deals with faulty data. It drops it silently and completely, and that disturbs the process (as there's no way to tell what happened). – Pekka Mar 05 '12 at 22:13
  • 1
    @Pekka - I think utf8_encode is for explicitly changing from iso88591 to utf8. iconv is more generally applicable according to php.net: http://www.php.net/manual/en/function.utf8-encode.php – Ross May 13 '14 at 16:17
  • How to make `json_encode` do wrong characters removing itself? Is there any flag for it? – Finesse May 29 '17 at 05:26
7

This function will remove all invalid UTF8 chars from a string:

function removeInvalidChars( $text) {
    $regex = '/( [\x00-\x7F] | [\xC0-\xDF][\x80-\xBF] | [\xE0-\xEF][\x80-\xBF]{2} | [\xF0-\xF7][\x80-\xBF]{3} ) | ./x';
    return preg_replace($regex, '$1', $text);
}

I use it after converting an Excel document to json, as Excel docs aren't guaranteed to be in UTF8.

I don't think there's a particularly sensible way of converting invalid chars to a visible but valid character. You could replace invalid chars with U+FFFD which is the unicode replacement character by turning the regex above around, but that really doesn't provide a better user experience than just dropping invalid chars.

Danack
  • 24,939
  • 16
  • 90
  • 122
6
$s = iconv('UTF-8', 'UTF-8//IGNORE', $s);

This solved the problem. I am not sure why the guys from php haven't made the life easier by fixing json_encode().

Anyway using the above allows json_encode() to create object even if the data contains special characters (swedish letters for example).

You can then use the result in javascript without the need of decoding the data back to its original encoding (with escape(), unescape(), encodeURIComponent(), decodeURIComponent());

I am using it like this in php (smarty):

$template = iconv('UTF-8', 'UTF-8//IGNORE', $screen->fetch("my_template.tpl"));

Then I am sending the result to javascript and just innerHTML the ready template (html peace) in my document.

Simply said above line should be implemented in json_encode() somehow in order to allow it to work with any encoding.

Troy Alford
  • 26,660
  • 10
  • 64
  • 82
moubi
  • 330
  • 2
  • 14
  • This didn't help my case: `json_encode(iconv('UTF-8', 'UTF-8//IGNORE', 'I’m gonna'))` still shows `\u2019` instead of ’. – Ryan Jan 04 '19 at 01:10
3

Instead of using the iconv function, you can direclty use the json_encode with the JSON_UNESCAPED_UNICODE option ( >= PHP5.4.0 )

Make sure you put "charset=utf-8" in the header of your php file:

header('Content-Type: application/json; charset=utf-8');

CR7
  • 581
  • 4
  • 9
  • I don't see how this would help - all `JSON_UNESCAPED_UNICODE` seems to do is it won't convert Unicode characters into `\uxxxx` entities? It doesn't mean it won't result in an empty string when encountering invalid characters. – Pekka Mar 07 '13 at 09:30
  • This worked great for me! I had seen something similar suggested in another thread but was missing the fact that I needed to add the header, thanks! – Wingman1487 Apr 27 '15 at 00:16
3

You need to know the encoding of all strings you're dealing with, or you're entering a world of pain.

UTF-8 is an easy encoding to use. Also, JSON is defined to use UTF-8 (http://www.json.org/JSONRequest.html). So why not use it?

Short answer: the way to avoid json_encode() dropping your strings is to make sure they are valid UTF-8.

metamatt
  • 13,809
  • 7
  • 46
  • 56
  • 1
    Yeah, true and I'm aware of that. As I said, it just becomes incredibly difficult to debug a broken incoming encoding when suddenly, parts of your JSON simply start vanishing (instead of looking broken). This is more to find errors more easily than to circumvent the broken encoding itself – Pekka Jan 11 '11 at 23:25
  • Wrap or replace json_decode() with something that tests the encoding of each string, and complains somewhere you'll actually see it when any string is not valid UTF-8? – metamatt Jan 11 '11 at 23:58
0

to get a informational error notification on json failures we use this helper:

  • installs temporarily a custom error handler to catch json errors for encoding/decoding
  • throws RuntimeException on error
<?php

/**
 * usage:
 * $json = HelperJson::encode(['bla'=>'foo']);
 * $array = HelperJson::decode('{"bla":"foo"}');
 * 
 * throws exception on failure
 * 
 */
class HelperJson {

    /**
     * @var array
     */
    static private $jsonErrors = [
            JSON_ERROR_NONE => '',
            JSON_ERROR_UTF8 => 'Malformed UTF-8 characters, possibly incorrectly encoded',
            JSON_ERROR_DEPTH => 'Maximum stack depth exceeded',
            JSON_ERROR_STATE_MISMATCH => 'Underflow or the modes mismatch',
            JSON_ERROR_CTRL_CHAR => 'Unexpected control character found',
            JSON_ERROR_SYNTAX => 'Syntax error, malformed JSON',
    ];

    /**
     * ! assoc ! (reverse logic to php function)
     * @param string $jsonString
     * @param bool $assoc
     * @throws RuntimeException
     * @return array|null
     */
    static public function decode($jsonString, $assoc=true){

        HelperJson_ErrorHandler::reset(); // siehe unten
        set_error_handler('HelperJson_ErrorHandler::handleError');

        $result = json_decode($jsonString, $assoc);

        $errStr = HelperJson_ErrorHandler::getErrstr();
        restore_error_handler();

        $jsonError = json_last_error();
        if( $jsonError!=JSON_ERROR_NONE ) {
            $errorMsg = isset(self::$jsonErrors[$jsonError]) ? self::$jsonErrors[$jsonError] : 'unknown error code: '.$jsonError;
            throw new \RuntimeException('json decoding error: '.$errorMsg.' JSON: '.substr($jsonString,0, 250));
        }
        if( $errStr!='' ){
            throw new \RuntimeException('json decoding problem: '.$errStr.' JSON: '.substr($jsonString,0, 250));
        }
        return $result;
    }

    /**
     * encode with error "throwing"
     * @param mixed $data
     * @param int $options   $options=JSON_PRESERVE_ZERO_FRACTION+JSON_UNESCAPED_SLASHES : 1024 + 64 = 1088
     * @return string
     * @throws \RuntimeException
     */
    static public function encode($data, $options=1088){

        HelperJson_ErrorHandler::reset();// scheint notwendg da sonst bei utf-8 problemen nur eine warnung geflogen ist, die hier aber nicht durchschlug, verdacht der error handler macht selbst was mit json und reset damit json_last_error
        set_error_handler('HelperJson_ErrorHandler::handleError');

        $result = json_encode($data, $options);

        $errStr = HelperJson_ErrorHandler::getErrstr();
        restore_error_handler();

        $jsonError = json_last_error();
        if( $jsonError!=JSON_ERROR_NONE ){
            $errorMsg = isset(self::$jsonErrors[$jsonError]) ? self::$jsonErrors[$jsonError] : 'unknown error code: '.$jsonError;
            throw new \RuntimeException('json encoding error: '.$errorMsg);
        }
        if( $errStr!='' ){
            throw new \RuntimeException('json encoding problem: '.$errStr);
        }
        return $result;
    }

}

/**

HelperJson_ErrorHandler::install();
preg_match('~a','');
$errStr = HelperJson_ErrorHandler::getErrstr();
HelperJson_ErrorHandler::remove();

 *
 */
class HelperJson_ErrorHandler {

    static protected  $errno = 0;
    static protected  $errstr = '';
    static protected  $errfile = '';
    static protected  $errline = '';
    static protected  $errcontext = array();

    /**
     * @param int $errno
     * @param string $errstr
     * @param string $errfile
     * @param int $errline
     * @param array $errcontext
     * @return bool
     */
    static public function handleError($errno, $errstr, $errfile, $errline, $errcontext){
        self::$errno = $errno;
        self::$errstr = $errstr;
        self::$errfile = $errfile;
        self::$errline = $errline;
        self::$errcontext = $errcontext;
        return true;
    }

    /**
     * @return int
     */
    static public function getErrno(){
        return self::$errno;
    }
    /**
     * @return int
     */
    static public function getErrstr(){
        return self::$errstr;
    }
    /**
     * @return int
     */
    static public function getErrfile(){
        return self::$errfile;
    }
    /**
     * @return int
     */
    static public function getErrline(){
        return self::$errline;
    }
    /**
     * @return array
     */
    static public function getErrcontext(){
        return self::$errcontext;
    }
    /**
     * reset last error
     */
    static public function reset(){
        self::$errno = 0;
        self::$errstr = '';
        self::$errfile = '';
        self::$errline = 0;
        self::$errcontext = array();
    }

    /**
     * set black-hole error handler
     */
    static public function install(){
        self::reset();
        set_error_handler('HelperJson_ErrorHandler::handleError');
    }

    /**
     * restore previous error handler
     */
    static function remove(){
        restore_error_handler();
    }
}

Grain
  • 553
  • 4
  • 5
0

WordPress has a wrapper around JSON that prevents this issue, you can look at the source code of wp_json_encode, but it boils down to:

$data = [ utf8_decode("Düsseldorf"), "Washington", "Nairobi" ];

foreach ( $data as &$string ) {
  $encoding = mb_detect_encoding( $string, mb_detect_order(), true );
  if ( $encoding ) {
      return mb_convert_encoding( $string, 'UTF-8', $encoding );
  } else {
      return mb_convert_encoding( $string, 'UTF-8', 'UTF-8' );
  }
}

json_encode( $data );

// Result: ["D?sseldorf","Washington","Nairobi"]

For data that is an recursive array, an object, or can contain non scalars, check _wp_json_sanity_check for a more detailed code.

Lucas Bustamante
  • 15,821
  • 7
  • 92
  • 86
0

Remove non-printable characters from strings

$result = preg_replace('/[[:^print:]]/', "", $string);

Soluction by https://alvinalexander.com/php/how-to-remove-non-printable-characters-in-string-regex/