0

I need to parse a file that looks like a JSON file but it ain't. It misses the : char so i cannot parse it using json_decode. I'm not the owner of this file so i have to take it like it is.. How can i parse this file ? Any thoughts? Thank you

"sound_materials"
{
    "common"
    {
        "value"     "0"
        "start_drag_sound"      "ui.inv_pickup"
        "end_drag_sound"        "ui.inv_drop"
        "equip_sound"       "ui.inv_equip"
    }
    "chest"
    {
        "value"     "1"
        "start_drag_sound"      "ui.inv_pickup_chest"
        "end_drag_sound"        "ui.inv_drop_chest"
    }
    "pennant"
    {
        "value"     "2"
        "start_drag_sound"      "ui.inv_pickup_pennant"
        "end_drag_sound"        "ui.inv_drop_pennant"
    }
    "key"
    {
        "value"     "3"
        "start_drag_sound"      "ui.inv_pickup_key"
        "end_drag_sound"        "ui.inv_drop_key"
    }
    "metal_small"
    {
        "value"     "4"
        "start_drag_sound"      "ui.inv_pickup_metalsmall"
        "end_drag_sound"        "ui.inv_drop_metalsmall"
        "equip_sound"       "ui.inv_equip_metalsmall"
    }
    "metal_armor"
    {
        "value"     "5"
        "start_drag_sound"      "ui.inv_pickup_metalarmour"
        "end_drag_sound"        "ui.inv_drop_metalarmour"
        "equip_sound"       "ui.inv_equip_metalarmour"
    }
    "metal_blade"
    {
        "value"     "6"
        "start_drag_sound"      "ui.inv_pickup_metalblade"
        "end_drag_sound"        "ui.inv_drop_metalblade"
        "equip_sound"       "ui.inv_equip_metalblade"
    }
    "metal_heavy"
    {
        "value"     "7"
        "start_drag_sound"      "ui.inv_pickup_metalheavy"
        "end_drag_sound"        "ui.inv_drop_metalheavy"
        "equip_sound"       "ui.inv_equip_metalheavy"
    }
    "staff_or_blunt"
    {
        "value"     "8"
        "start_drag_sound"      "ui.inv_pickup_staff"
        "end_drag_sound"        "ui.inv_drop_staff"
        "equip_sound"       "ui.inv_equip_staff"
    }
    "robes"
    {
        "value"     "9"
        "start_drag_sound"      "ui.inv_pickup_robes"
        "end_drag_sound"        "ui.inv_drop_robes"
        "equip_sound"       "ui.inv_equip_robes"
    }
    "leather"
    {
        "value"     "10"
        "start_drag_sound"      "ui.inv_pickup_leather"
        "end_drag_sound"        "ui.inv_drop_leather"
        "equip_sound"       "ui.inv_equip_leather"
    }
    "quiver"
    {
        "value"     "11"
        "start_drag_sound"      "ui.inv_pickup_quiver"
        "end_drag_sound"        "ui.inv_drop_quiver"
        "equip_sound"       "ui.inv_equip_quiver"
    }
    "stone"
    {
        "value"     "12"
        "start_drag_sound"      "ui.inv_pickup_stone"
        "end_drag_sound"        "ui.inv_drop_stone"
        "equip_sound"       "ui.inv_equip_stone"
    }
    "wood"
    {
        "value"     "13"
        "start_drag_sound"      "ui.inv_pickup_wood"
        "end_drag_sound"        "ui.inv_drop_wood"
        "equip_sound"       "ui.inv_equip_wood"
    }
    "bone"
    {
        "value"     "14"
        "start_drag_sound"      "ui.inv_pickup_bone"
        "end_drag_sound"        "ui.inv_drop_bone"
        "equip_sound"       "ui.inv_equip_bone"
    }
    "jug"
    {
        "value"     "15"
        "start_drag_sound"      "ui.inv_pickup_jug"
        "end_drag_sound"        "ui.inv_drop_jug"
        "equip_sound"       "ui.inv_equip_jug"
    }
    "gun"
    {
        "value"     "16"
        "start_drag_sound"      "ui.inv_pickup_gun"
        "end_drag_sound"        "ui.inv_drop_gun"
        "equip_sound"       "ui.inv_equip_gun"
    }
    "highvalue"
    {
        "value"     "17"
        "start_drag_sound"      "ui.inv_pickup_highvalue"
        "end_drag_sound"        "ui.inv_drop_highvalue"
        "equip_sound"       "ui.inv_equip_highvalue"
    }
}

EDIT:

So i used the regex that h2o suggered and it works great to format the file. My mistake is that in the example above i only put a part with 1 line key.

I have some others part of the file where you have sub keys and in this case i should need to add the [ ] delimiter for subkeys.. :

Bobby Shark
  • 1,074
  • 2
  • 17
  • 33
  • well its not json... is it in an array ? or a string ? or wat ? – Deepanshu Goyal Oct 24 '13 at 11:41
  • You have to do some heavy string parsing. See http://stackoverflow.com/questions/13236819/how-to-fix-badly-formatted-json-in-php – subZero Oct 24 '13 at 11:41
  • How about a regex? You could do a search and replace for a line with two strings inside quotes, separated by whitespace, and replace the whitespace with a colon. – Matthew Daly Oct 24 '13 at 11:42
  • Ugh, manually writing a parser of your own seems to me the only sane way (or: complain hard & load enough to the supplier of the file, which can then either fix this if this is supposed to be json, or provide a parser to you if it isn't). If you're stuck with writing one of your own: look up some howto's how to create parsers on the interwebs. – Wrikken Oct 24 '13 at 11:45

2 Answers2

4

That's absolutely afwul whatever-that-format-is-because-it-isn't-json. If you can guarantee that it always looks exactly like in your OP (one key per line), then you can fix it by doing this:

$json = preg_replace('/^(\s*"[^"]+")/m', '$1:', $json);

DEMO

Regex autopsy:

  • ^ - the line MUST start here
  • (\s*"[^"]+") - A capturing group (this is what $1 is) matching:
    • \s* - a space/tab/newline repeated 0 or more times
    • " - a literal " character
    • [^"]+ - Any character that isn't " repeated 1 or more times
    • " - a literal " character
  • /m our modifier (multiline). This means that ^ will work per line instead of only matching the start of the entire string.

Edit:

WARNING: This doesn't add commas between the values!

You might be better off using:

$json = preg_replace('/("[^"]+")(\s*{[^}]+})/', '$1:$2,', $json); //Add comma for brackets
$json = preg_replace('/("[^"]+")(\s*"[^"]+")/', '$1:$2,', $json); //Add comma for values

This would also work on a single line, but it requires that you never use the characters {, } or " anywhere else but tokens (even inside strings).

Edit again:

This seems to do the trick, can use json_decode and parses JSONLint, but it's incredibly ugly and obscure:

$json = preg_replace('/(")(\s*{)/m', '$1:$2', $json); //Fix colons after keys with brackets
$json = preg_replace('/(")([ \t]*")/m', '$1:$2', $json); //Fix colons after keys with values
$json = preg_replace('/(}\s*$)(\s*")/m', '$1,$2', $json); //Fix commas on lines with brackets
$json = preg_replace('/("\s*$)(\s*")/m', '$1,$2', $json); //Fix commas on lines with values
$json = preg_replace('/"[0-9]+":\s*{/m', '{', $json); //Fix invalid keys
$json = trim($json);

if ($json[0] == '{' && substr($json, -1) == '}') {
    $json = '[' . $json . ']';
} else {
    $json = '{' . $json . '}';
}

print_r(json_decode($json));

Update:

<?php
    $curl = curl_init();
    curl_setopt_array($curl, array(
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_URL => "file.txt"
    ));
    $json = curl_exec($curl);

    $json = Horrible_JSON::Parse($json);
    print_r($json);

    class Horrible_JSON {
        public static function Parse($json) {
            $jsonLength = strlen($json);
            $realJSON = '';
            $isValue = false;
            for ($i = 0; $i < $jsonLength; $i++) {
                if ($json[$i] != "\n" && $json[$i] != "\r" && $json[$i] != "\t" && $json[$i] != " ") {
                    if ($json[$i] == '"') {
                        $nextQuote = strpos($json, '"', $i + 1);
                        $quoteContent = substr($json, $i + 1, $nextQuote - $i - 1);
                        if (!$isValue && preg_match('/^[0-9]+$/', $quoteContent)) {
                            $quoteContent = 'int_' . $quoteContent;
                        }
                        $realJSON .= '"' . $quoteContent . '"';
                        if (!$isValue) {
                            $realJSON .= ':';
                            $isValue = true;
                        } else {
                            $realJSON .= ',';
                            $isValue = false;
                        }
                        $i = $nextQuote;
                    } else {
                        if ($json[$i] == '{' || $json[$i] == '}') {
                            $isValue = false;
                        }
                        $realJSON .= $json[$i];
                        if ($json[$i] == '}') {
                            $realJSON .= ',';
                        }
                    }
                }
            }
            $realJSON = str_replace(',}', '}', $realJSON);
            $realJSON = substr($realJSON, 0, -1);

            if (substr($realJSON, 0, 1) == '{' && substr($realJSON, -1) == '}') {
                $realJSON = '[' . $realJSON . ']';
            } else {
                $realJSON = '{' . $realJSON . '}';
            }

            return json_decode($realJSON);
        }
    }
?>
Bobby Shark
  • 1,074
  • 2
  • 17
  • 33
h2ooooooo
  • 39,111
  • 8
  • 68
  • 102
  • Hi. Thanks for your answer, the regex works great but i think i have to find an other way because the file has some subkeys and use the { delimiters instead of [{ .. I really don't know how you could do that since it looks exactly like a primary key. Will edit my question – Bobby Shark Oct 24 '13 at 12:10
  • @BobbyShark Check out my newest edit. It's ugly but it seems to work. – h2ooooooo Oct 24 '13 at 12:28
  • @BobbyShark Final version is now in my answer. It seems to work with both of your JSON variables. Please never let me see this code ever again. – h2ooooooo Oct 24 '13 at 12:53
  • wow. This is working! Amazing! :) The file has 130'000 lines and it get parsed ! So nice thank you! Btw, i can get it parsed only when i copy/paste the file content in the php file and put it in a string. If i use readfile, curl or file_get_contents it cannot get parsed. Any thoughts? – Bobby Shark Oct 24 '13 at 13:54
  • @BobbyShark That's really odd - perhaps your script is running out of memory. Could you post your full string in a pastebin or something similar (or just link the "json" file)? – h2ooooooo Oct 24 '13 at 13:57
  • oh and about the issue with curl, readfile you can test it with the example you used for your answer. just put in a txt file and you'll see it doesnt work – Bobby Shark Oct 24 '13 at 14:21
  • @BobbyShark Alright - here's a try without using regex and it seems to work. Do however note that arrays do NOT work and get turned into `int_0`, `int_1` etc as keys. You can of course still iterate through as normal. – h2ooooooo Oct 24 '13 at 15:01
  • Dunno what to say..you just saved my life! This is incredible nice work even if you think it's horrible code, that will help me so much! I gonna try to parse that file with all datas i need, but i think with what you've done i could manage it by myself. Glad to see people like you around this site for taking so much time to help noobs like me ! Thanks again – Bobby Shark Oct 24 '13 at 15:12
  • @BobbyShark You're more than welcome. Remember to accept the answer if it helped you :) – h2ooooooo Oct 24 '13 at 15:19
0

Without access to the original file it's rather difficult to reverse engineer exactly hw it is structured.

If this is a one-off then just use a text editor - it's fairly obvious where to insert ':'s to make it look like a JSON file.

If you need to process a lot of these - then get in touch with whoever produces the data and ask for a formal definition of the format or for them to change the format to JSON.

If neither of these are possible then, writing the code to inject a : between 2 quoted entities is trivial. But you have no guarantee that this is a valid interpretation of the file format

symcbean
  • 47,736
  • 6
  • 59
  • 94