1

I have a problem. For a mistake I have a lot of not valid JSON strings like this:

{
    "d": {
        "results": [
            {
                "__metadata": {
                    "uri": "https://api.datamarket.azure.com/Data.ashx/Bing/Search/Web?Query=u0027non supporting iframesu0027&Market=u0027it-ITu0027&Adult=u0027Offu0027&Options=u0027DisableLocationDetectionu0027&WebSearchOptions=u0027DisableQueryAlterationsu0027&$skip=0&$top=1",
                    "type": "WebResult"
                },
                "ID": "7858fc9f-6bd5-4102-a835-0fa89e9f992a",
                "Title": "something good",
                "Description": "something "WRONG" here!",
                "DisplayUrl": "www.devx.com/Java/Article/27685/1954",
                "Url": "http://www.devx.com/Java/Article/27685/1954"
            }
        ],
        "__next": "https://api.datamarket.azure.com/Data.ashx/Bing/Search/Web?Query=u0027non%20supporting%20iframesu0027&Market=u0027it-ITu0027&Adult=u0027Offu0027&Options=u0027DisableLocationDetectionu0027&WebSearchOptions=u0027DisableQueryAlterationsu0027&$skip=50"
    }
}

As you can see the field Description contains a bad string (" into "), so I'm not able to parse the json using php's json_decode, infact it returns NULL. I've 1 million of wrong json, much more big than this (10 times). How can I do in php?

hakre
  • 193,403
  • 52
  • 435
  • 836
wildnove
  • 2,185
  • 2
  • 24
  • 32
  • This same topic came up yesterday, and some weeks before, and a few more occasions even. – mario Nov 07 '12 at 14:05
  • Who wrote the JSON in the first place? You? Or is it from an external source? – Martin Bean Nov 07 '12 at 14:07
  • I would use some type of regex to convert the interior quotes to special chars like `"` – Pitchinnate Nov 07 '12 at 14:07
  • In your case you could exploit the fact that strings in json could not be over multiple lines, however as you already have that problem in there I can not say if this would be the case with all your json-like strings. – hakre Nov 07 '12 at 14:09
  • @Martin Bean external source. But now I've to correct all in a massive and fast way using php. – wildnove Nov 07 '12 at 14:11
  • @Pitchinnate something like...? – wildnove Nov 07 '12 at 14:11
  • @hakre unable to understand... – wildnove Nov 07 '12 at 14:12
  • @wildnove Do you have an example URL? – Martin Bean Nov 07 '12 at 14:12
  • @Martin Bean, sorry martin... no external source... I've a big file – wildnove Nov 07 '12 at 14:14
  • 1
    Links: [Convert invalid json into valid json](http://stackoverflow.com/q/8815586/367456); [how to fix a malformed JSON in php](http://stackoverflow.com/q/6911182/367456); [Invalid JSON parsing using PHP](http://stackoverflow.com/q/1575198/367456) - As written it depends. See also: http://stackoverflow.com/search?q=%5Bphp%5D+json+invalid – hakre Nov 07 '12 at 14:18
  • possible duplicate of [How to fix badly formatted JSON in PHP?](http://stackoverflow.com/questions/13236819/how-to-fix-badly-formatted-json-in-php) – Peter O. Dec 04 '12 at 16:52

1 Answers1

1

In your case you could exploit the fact that strings in json could not be over a line. That is a snappy point to grab with s multi-line aware search and replace with a regular expression function like preg_match_callback in PHP.

 /^\s+"[a-z_"]+": "([^"]*".*)",?$/mi

Whitespace at the beginning of the line; member-name in form of a valid name (only characters and underscore here) as a string; the : and then the broken string until the end of the line optionally followed by a comma ,?.

This regex already matches only invalid lines. However if your json also contains a valid string with \" inside, this regex does not really work.

So it's also good to place some checks that the replacement would do what it is intended.

$like = '... json-like but broken json string as in question ...';

// Fixing #1: member strings containing double-quotes on the same line.

$fix1Pattern   = '/^(\s+"[a-z_]+": ")([^"]*".*)(",?)$/mi';

$fix1Callback  = function ($matches) {
    list($full, $prefix, $string, $postfix) = $matches;
    $fixed = strtr($string, ['"' => '\"']);
    if (!is_string(json_decode("\"$fixed\""))) {
        throw new Exception('Fix #1 did not work as intended');
    }
    return "$prefix$fixed$postfix";
};


// apply fix1 onto the string

$buffer = preg_replace_callback($fix1Pattern, $fix1Callback, $like);


// test if it finally works

print_r(json_decode($buffer));

Keep in mind that this is limited. You might need to learn about regular expressions first which is a world of it's own. But the principle is often very similar: You search the string for the patterns that are the broken parts and then you do some string manipulation to fix these.

If the json string is much more broken, then this needs even more love, probably not to be easily solved with a regular expression alone.

Exemplary output for the code-example and the data provided:

stdClass Object
(
    [d] => stdClass Object
        (
            [results] => Array
                (
                    [0] => stdClass Object
                        (
                            [__metadata] => stdClass Object
                                (
                                    [uri] => https://api.datamarket.azure.com/Data.ashx/Bing/Search/Web?Query=u0027non supporting iframesu0027&Market=u0027it-ITu0027&Adult=u0027Offu0027&Options=u0027DisableLocationDetectionu0027&WebSearchOptions=u0027DisableQueryAlterationsu0027&$skip=0&$top=1
                                    [type] => WebResult
                                )

                            [ID] => 7858fc9f-6bd5-4102-a835-0fa89e9f992a
                            [Title] => something good
                            [Description] => something "WRONG" here!
                            [DisplayUrl] => www.devx.com/Java/Article/27685/1954
                            [Url] => http://www.devx.com/Java/Article/27685/1954
                        )

                )

            [__next] => https://api.datamarket.azure.com/Data.ashx/Bing/Search/Web?Query=u0027non%20supporting%20iframesu0027&Market=u0027it-ITu0027&Adult=u0027Offu0027&Options=u0027DisableLocationDetectionu0027&WebSearchOptions=u0027DisableQueryAlterationsu0027&$skip=50
        )

)
hakre
  • 193,403
  • 52
  • 435
  • 836