1

I've got some sort of encoding issue when trying to retrieve a third-party feed, which when using json_last_error() reports back with Unexpected control character found.

From what I've read, this can be caused by a non UTF-8 character appearing in the mix.

I've run the copied JSON through a linter, and is valid. Copy/pasting the JSON from the remote feed into a string and decoding that way works fine, just not when directly accessing via file_get_contents.

{
    "numberOfResults": 124,
    "queryTime": 0,
    "products": [
        {
            "productId": "9130047$0290f955-ce36-46c9-9771-184f05985c62",
            "status": null,
            "serviceId": null,
            "productName": null,
            "serviceName": null,
            "productDescription": null,
            "serviceDescription": null,
            "productCategoryId": null,
            "nearestLocation": null,
            "boundary": null,
            "distanceToLocation": null,
            "startDate": null,
            "endDate": null,
            "productImage": null,
            "serviceImage": null,
            "tqual": null,
            "trip_advisor": null,
            "freeEntry": null,
            "booster": null,
            "starRating": null,
            "rateFrom": null,
            "rateTo": null,
            "productClassifications": null,
            "internet_service_ssid": null,
            "internet_service_type": null,
            "linked_productid": null,
            "states": null,
            "suburbs": null,
            "addresses": null,
            "cities": null,
            "comms_em": null,
            "comms_mb": null,
            "comms_burl": null,
            "comms_url": null,
            "comms_ph": null,
            "comms_fx": null,
            "comms_wap": null,
            "internet_points": null
        }
    ],
    "facetGroups": []
}

And just a simple decode...

$raw = file_get_contents($url);
$result = json_decode($raw, false);

// json_last_error() shows JSON_ERROR_CTRL_CHAR
crawf
  • 9,448
  • 10
  • 33
  • 43
  • possible duplicate of [Problem with json\_decode PHP](http://stackoverflow.com/questions/6324645/problem-with-json-decode-php) – Marcin Orlowski Nov 11 '14 at 21:28
  • Reported as a bug in PHP 5.32. What are oyu using? http://grokbase.com/t/php/php-bugs/1076k3pade/php-bug-bug-52262-new-json-decode-reports-no-error-while-returning-null – Len_D Nov 11 '14 at 21:29
  • Using 5.4.34, also tried using stripslashes and htmlentities... – crawf Nov 11 '14 at 21:48
  • Did you check, visually, what $raw contains? – vcanales Nov 11 '14 at 21:49
  • 1
    Run the data through `hd`, chances are that there are invisible chars that still violate the JSON spec. Alternatively, regex-search for anything that is not inside the expected character set and see what you find. – Ulrich Eckhardt Nov 11 '14 at 21:51
  • Have you tried file_get_contents($url,0,null,null) ? – vcanales Nov 11 '14 at 21:55
  • @devJunk - $raw looks fine, no strange characters. Also the full file_get_contents call not working either. Also, what's hd? – crawf Nov 11 '14 at 22:41

1 Answers1

0

Thanks to @UlrichEckhardt suggestion, this link provided some nice Regex in case anyone else comes across this issue.

// Modified from http://magp.ie/2011/01/06/remove-non-utf8-characters-from-string-with-php/
// Simply strip out incompatible chars
function lint_json($string) {
    //reject overly long 2 byte sequences, as well as characters above U+10000 and replace with ?
    $string = preg_replace('/[\x00-\x08\x10\x0B\x0C\x0E-\x19\x7F]|[\x00-\x7F][\x80-\xBF]+|([\xC0\xC1]|[\xF0-\xFF])[\x80-\xBF]*|[\xC2-\xDF]((?![\x80-\xBF])|[\x80-\xBF]{2,})|[\xE0-\xEF](([\x80-\xBF](?![\x80-\xBF]))|(?![\x80-\xBF]{2})|[\x80-\xBF]{3,})/S', '', $string );

    //reject overly long 3 byte sequences and UTF-16 surrogates and replace with ?
    $string = preg_replace('/\xE0[\x80-\x9F][\x80-\xBF]|\xED[\xA0-\xBF][\x80-\xBF]/S','', $string );

    return $string;
}

EDIT:

After further investigation, it came down to the supplied JSON being in UTF-16, which causes obvious issues when using json_decode. The below code fixes that.

function lint_json2($string) {
    $string = iconv('UTF-16LE//IGNORE', 'UTF-8', $string);

    // Dirty, but strip anything before first JSON opening tag
    $string = strstr($string, '{');

    return $string;
}
crawf
  • 9,448
  • 10
  • 33
  • 43
  • One of the points of JSON is that it is Unicode-capable. Filtering anything outside of the basic multilingual plane means that it doesn't work for several cases. Fix the code that generates the broken JSON output instead of trying to work around it. – Ulrich Eckhardt Nov 12 '14 at 20:33
  • That's a fair call - however I don't have control over what's generating the feed. All I can do is suggest to them to fix it. However, further investigation in edit above. – crawf Nov 12 '14 at 22:33