5

I'm getting a data feed which is in JSON format and the only available format. In PHP, I'm using json_decode to decode the JSON, but it was breaking, and I found out that the JSON was generated in some places with double quotes in a person's nick name. I verified this using: http://jsonformatter.curiousconcept.com

I don't have control over the creation of the data, but I have to deal with this broken format when it occurs. This data after it's parsed will be put into a MySQL TABLE.

For example:

"contact1": "David "Dave" Letterman",

json_decode would return a NULL. If I manually saved the file, and changed it to single quotes around the nickname of Dave, then everything worked.

$json_string = file_get_contents($json_download);
$json_array = json_decode($json_string, true);

How do I fix the broken JSON format in json_string before it gets processed by json_decode? What should be done to pre-process the file, backslash the double quotes of the nickname? Or change them to single quotes? Is it even a good idea to store double quotes like this in MySQL?

I don't know when this might occur with each data feed, so I don't want to just check for contact1 if it has inner double quotes to fix them. Is there a way in PHP to take a line such as the above example, and backslash everything after the colon except the outer double quotes? Thanks!

This is the correct code for as provided by tftd:

<?php
// This:
// "contact1": "David "Dave" Letterman",
// Needs to look like this to be decoded by JSON:
// "contact1": "David \"Dave\" Letterman",

$data ='"contact1": "David "Dave" Letterman",';
function replace($match){
    $key = trim($match[1]);
    $val = trim($match[2]);

    if($val[0] == '"')
        $val = '"'.addslashes(substr($val, 1, -1)).'"';
    else if($val[0] == "'")
        $val = "'".addslashes(substr($val, 1, -1))."'";

    return $key.": ".$val;
}
$preg = preg_replace_callback("#([^{:]*):([^,}]*)#i",'replace',$data);
var_dump($preg);
$json_array = json_decode($preg);
var_dump($json_array);
echo $json_array . "\n";
echo $preg . "\n";
?>

Here is the output:

string(39) ""contact1": "David \"Dave\" Letterman","
NULL

"contact1": "David \"Dave\" Letterman",
Edward
  • 9,430
  • 19
  • 48
  • 71
  • 11
    Can you contact the company that owns the feed and tell them how stupid they are? – SLaks Nov 05 '12 at 17:05
  • 2
    Yes, I can, but then I won't have them as a client anymore. :-) – Edward Nov 05 '12 at 17:06
  • You could word it a bit nicer, but yea. Fixing 'escaping' issues is not something you want to do. It's sometimes even impossible since there can be multiple acceptable interpretations. – Halcyon Nov 05 '12 at 17:07
  • 1
    Explain the GIGO principal to them, then explain that you can fix their data on the fly, but your billables will go up by 50,000% to hire the staff to do it. – Marc B Nov 05 '12 at 17:09
  • If they are you're client it should not be a big deal to inform them that they have an error in their system. Software bugs exist, pointing one out should not be reason for you to lose a client. Present this in terms of this being an issue they should address internally, but if they don't you will need to conduct extra work at extra cost in order to fix the data source on your end. – Mike Brant Nov 05 '12 at 17:09
  • I appreciate what everyone is saying and I agree, but in this situation for now, I need to fix this JSON. – Edward Nov 05 '12 at 17:14
  • Similar question and an answer that does the line-by-line fix: http://stackoverflow.com/a/13272743/367456 - in case it's helpful. – hakre Nov 07 '12 at 15:28

5 Answers5

11

I have a own jsonFixer() function - it works in two steps: removing garbage (for equality of incoherent formatting) and reformatting.

<?php
  function jsonFixer($json){
    $patterns     = [];
    /** garbage removal */
    $patterns[0]  = "/([\s:,\{}\[\]])\s*'([^:,\{}\[\]]*)'\s*([\s:,\{}\[\]])/"; //Find any character except colons, commas, curly and square brackets surrounded or not by spaces preceded and followed by spaces, colons, commas, curly or square brackets...
    $patterns[1]  = '/([^\s:,\{}\[\]]*)\{([^\s:,\{}\[\]]*)/'; //Find any left curly brackets surrounded or not by one or more of any character except spaces, colons, commas, curly and square brackets...
    $patterns[2]  =  "/([^\s:,\{}\[\]]+)}/"; //Find any right curly brackets preceded by one or more of any character except spaces, colons, commas, curly and square brackets...
    $patterns[3]  = "/(}),\s*/"; //JSON.parse() doesn't allow trailing commas
    /** reformatting */
    $patterns[4]  = '/([^\s:,\{}\[\]]+\s*)*[^\s:,\{}\[\]]+/'; //Find or not one or more of any character except spaces, colons, commas, curly and square brackets followed by one or more of any character except spaces, colons, commas, curly and square brackets...
    $patterns[5]  = '/["\']+([^"\':,\{}\[\]]*)["\']+/'; //Find one or more of quotation marks or/and apostrophes surrounding any character except colons, commas, curly and square brackets...
    $patterns[6]  = '/(")([^\s:,\{}\[\]]+)(")(\s+([^\s:,\{}\[\]]+))/'; //Find or not one or more of any character except spaces, colons, commas, curly and square brackets surrounded by quotation marks followed by one or more spaces and  one or more of any character except spaces, colons, commas, curly and square brackets...
    $patterns[7]  = "/(')([^\s:,\{}\[\]]+)(')(\s+([^\s:,\{}\[\]]+))/"; //Find or not one or more of any character except spaces, colons, commas, curly and square brackets surrounded by apostrophes followed by one or more spaces and  one or more of any character except spaces, colons, commas, curly and square brackets...
    $patterns[8]  = '/(})(")/'; //Find any right curly brackets followed by quotation marks...
    $patterns[9]  = '/,\s+(})/'; //Find any comma followed by one or more spaces and a right curly bracket...
    $patterns[10] = '/\s+/'; //Find one or more spaces...
    $patterns[11] = '/^\s+/'; //Find one or more spaces at start of string...

    $replacements     = [];
    /** garbage removal */
    $replacements[0]  = '$1 "$2" $3'; //...and put quotation marks surrounded by spaces between them;
    $replacements[1]  = '$1 { $2'; //...and put spaces between them;
    $replacements[2]  = '$1 }'; //...and put a space between them;
    $replacements[3]  = '$1'; //...so, remove trailing commas of any right curly brackets;
    /** reformatting */
    $replacements[4]  = '"$0"'; //...and put quotation marks surrounding them;
    $replacements[5]  = '"$1"'; //...and replace by single quotation marks;
    $replacements[6]  = '\\$1$2\\$3$4'; //...and add back slashes to its quotation marks;
    $replacements[7]  = '\\$1$2\\$3$4'; //...and add back slashes to its apostrophes;
    $replacements[8]  = '$1, $2'; //...and put a comma followed by a space character between them;
    $replacements[9]  = ' $1'; //...and replace by a space followed by a right curly bracket;
    $replacements[10] = ' '; //...and replace by one space;
    $replacements[11] = ''; //...and remove it.

    $result = preg_replace($patterns, $replacements, $json);

    return $result;
  }
?>

Example of usage:

<?php
  // Received badly formatted json:
  // {"contact1": "David "Dave" Letterman", price : 30.00, 'details' : "Greatest 'Hits' Album"}
  $json_string = '{"contact1": "David "Dave" Letterman", price : 30.00, \'details\' : "Greatest \'Hits\' Album"}';
  jsonFixer($json_string);
?>

Will result:

{"contact1": "David \"Dave\" Letterman", "price" : "30.00", "details" : "Greatest \'Hits\' Album"}

Note: this wasn't tested with all possible badly formatted JSON strings but I use on a complex multi level JSON string and is working well until then.

  • 1
    [\s||\t||.]*(,)[\r||\n||\s||\t]*[\}||\]] for trailing commas. – Barry Jul 08 '16 at 17:03
  • 1
    Thanks @Barry! But `\s` [matches any white space character](http://www.regular-expressions.info/shorthand.html): `[ \t\r\n\f]` and not only the `[ ]` (space) literally. ;D – Fábio - GreenAfter.com Sep 22 '16 at 01:00
  • @MERTDOĞAN I'm glad to help! Whenever I face some json that beat it, I come back here and update. ;D – Fábio - GreenAfter.com Oct 07 '17 at 18:58
  • 1
    Oppss!! Can't fix {"product_names":"Orbis Voltage 26" 21 Vites Kadin Beyaz Dag Bisikleti"} – MERT DOĞAN Oct 07 '17 at 21:06
  • @MERTDOĞAN Yeah, this needs a string value update of key 6 from the $patterns array at reformatting stage. The issue in there is "Find or not [...] surrounded by [...]" must be changed by "[...] surrounded [or not] by [...]" but I'll don't have time this month to test with your sent string, sorry. :/ – Fábio - GreenAfter.com Oct 11 '17 at 06:04
  • 1
    fantastic code but it doesn't work if there's a comma in a string. Example: {"test": "Washington, DC"} - the result of the function is: { "test":"Washington", "DC" } – Roman Feb 02 '19 at 17:08
  • Thanks @Roman! Another fix needed. But I'm still unsure when I'll be able to test, fix, update and edit here. ':/ I hope it's soon...! – Fábio - GreenAfter.com Oct 09 '20 at 06:20
  • 1
    fails on ```{key:'ds:14'}``` – hanshenrik May 20 '23 at 22:20
4

As others have already pointed out, it's best if you tell your client for the problem with the JSON formatting. Ask them to send a bugreport to the original developer/company so they could fix it. If he/they can't fix it - then offer your solution. You simply need to addslashes the string before you json_encode it.

If for some reason you end up having to fix the formatting, here is a way that might work for you:

$data = '"contact1": "David "Dave" Letterman", "contact2": "Peter "Robert" Smith",{\'test\': \'working "something"\'}';
function replace($match){
    $key = trim($match[1]);
    $val = trim($match[2]);

    if($val[0] == '"')
        $val = '"'.addslashes(substr($val, 1, -1)).'"';
    else if($val[0] == "'")
        $val = "'".addslashes(substr($val, 1, -1))."'";

    return $key.": ".$val;
}
$preg = preg_replace_callback("#([^{:]*):([^,}]*)#i",'replace',$data);
var_dump($preg);
// string '"contact1": "David \"Dave\" Letterman", "contact2": "Peter \"Robert\" Smith",{'test': 'working \"something\"'}' (length=110)

Keep in mind this may break if somebody messes with the json format again.

tftd
  • 16,203
  • 11
  • 62
  • 106
  • Thanks for the code. I get a different answer running this: string(74) "{"contact1": David \"Dave\" Letterman, "contact2": Peter \"Robert\" Smith}" Your example has single quotes surrounding it while when I run your code on my system which is CentOS with PHP 5.3.3 is has double quotes. – Edward Nov 05 '12 at 19:31
  • It looks literally like this: "contact1": "David "Dave" Letterman", what I need should be this: "contact1": "David \"Dave\" Letterman", but what I get with your code example is this: "contact1": 'David \"Dave\" Letterman', The double-quotes around the second field are being replaced with single quotes. – Edward Nov 05 '12 at 21:28
  • Try it now. I've replaced the single quotes with double ones. :) – tftd Nov 05 '12 at 21:35
  • Please look above, I edited mine original posting and appended it with the PHP you provided with my data example. Thanks! – Edward Nov 05 '12 at 22:14
  • 1
    I've edited my code again. Hopefully this time it works as you need it to. – tftd Nov 05 '12 at 22:30
0

Tell them to escape their strings before output. You can even offer to fix it or provide the code solution.

Otherwise you can use preg_replace with a regex expression

See Replacing specified double quotes in text with preg_replace

Community
  • 1
  • 1
Anthony Hatzopoulos
  • 10,437
  • 2
  • 40
  • 57
0

As other people have said, you can do a search and replace, but the hard part is going to be creating your fuzzy matching rules, because in order to parse it, you will need to assume some things. Probably, you will need to assume either:

1a) Keys don't contain colons
1b) or key quotes are properly escaped
and
2a) Values don't contain commas
2b) or values have properly escaped quotes.

Even then, you might get into situations where your parsing gets confused, and it gets worse if they have comments their JSON. (Not conforming, but very common.)

Now, depending on the data, you can use newlines to decide when you're looking at a new key, but again, that's not reliable and you start making big assumptions.

So, long story short you either have to make some assumptions that might be made wrong at any time, or you need to get them to fix the data.

Patrick M
  • 1,066
  • 8
  • 23
0

Regexp are not reliable when comma and [] are in the values is contains json strings, worries and nightmare start. In php json_decode fails without quotes on key, one suggest to use pear Services_JSON which achieves the safiest results if code fixed for class names and the game of invalid json is over:

<?php include("Services_JSON-1.0.3b/JSON.php"); 
//Patched version https://github.com/pear/Services_JSON/edit/trunk/JSON.php

$json = <<< JSONCODEFROMJS
   { 
      sos:presents, 
      james:'bond', 
      "agent":[0,0,7], 
      secret:"{mission:'impossible',permit: \"tokill\"}",
      go:true 
    }
JSONCODEFROMJS;

function json_fix($json) {
   $sjson = new Services_JSON(SERVICES_JSON_IN_ARR|SERVICES_JSON_USE_TO_JSON| SERVICES_JSON_LOOSE_TYPE);
   $json_array=$sjson->decode($json);
   return json_encode($json_array);
}

$json_array = json_decode(json_fix($json),true);

if(json_last_error() == JSON_ERROR_NONE) {

   $json=json_encode($json_array,JSON_PRETTY_PRINT);
   echo "<pre>";
   echo(htmlentities($json));
   echo "</pre>";
} else {
   die(json_last_error_msg());
}
?>