13

I need to parse a json file which unfortunately for me, does not follow the prototype. I have two issues with the data, but i've already found a workaround for it so i'll just mention it at the end, maybe someone can help there as well.

So i need to parse entries like this:

    "Test":{
        "entry":{
            "Type":"Something"
                },
        "entry":{
            "Type":"Something_Else"
                }
           }, ...

The json default parser updates the dictionary and therfore uses only the last entry. I HAVE to somehow store the other one as well, and i have no idea how to do this. I also HAVE to store the keys in the several dictionaries in the same order they appear in the file, thats why i am using an OrderedDict to do so. it works fine, so if there is any way to expand this with the duplicate entries i'd be grateful.

My second issue is that this very same json file contains entries like that:

         "Test":{
                   {
                       "Type":"Something"
                   }
                }

Json.load() function raises an exception when it reaches that line in the json file. The only way i worked around this was to manually remove the inner brackets myself.

Thanks in advance

Greg K.
  • 686
  • 1
  • 5
  • 18
  • Why dont you change the estructure, according your data model? OK, you have a file or data with this structure, true?. – ManuParra Mar 28 '15 at 19:46
  • Yes, its a whole json file. Do you have something in mind that i can do? – Greg K. Mar 28 '15 at 19:54
  • Really, this file is not JSON file. So if you want to parse it using JSON python parser, probably it doesnt work. Ok, I need a bit of this code to try. Try to use SIMPLEJSON python: http://simplejson.readthedocs.org/en/latest/ its more flexible than the original python json parser – ManuParra Mar 28 '15 at 20:01
  • What Python data structure do you want this data to be parsed into? Dictionaries can't have duplicate keys, that's just a property of dictionaries and has nothing to do with JSON. And how would you use this data structure? What would you expect a lookup by key to return, since keys wouldn't be unique? – Lukas Graf Mar 28 '15 at 20:19
  • I wanted to be able to modify the duplicate key name, and add it back to the dictionary with a different name. Not interested in keeping the same name in the dict, i just don't want to miss the data. – Greg K. Mar 28 '15 at 20:32
  • Answered here: http://stackoverflow.com/questions/20829646/how-do-i-parse-json-with-multiple-keys-the-same – frnhr Jul 11 '16 at 12:40

3 Answers3

24

You can use JSONDecoder.object_pairs_hook to customize how JSONDecoder decodes objects. This hook function will be passed a list of (key, value) pairs that you usually do some processing on, and then turn into a dict.

However, since Python dictionaries don't allow for duplicate keys (and you simply can't change that), you can return the pairs unchanged in the hook and get a nested list of (key, value) pairs when you decode your JSON:

from json import JSONDecoder

def parse_object_pairs(pairs):
    return pairs


data = """
{"foo": {"baz": 42}, "foo": 7}
"""

decoder = JSONDecoder(object_pairs_hook=parse_object_pairs)
obj = decoder.decode(data)
print obj

Output:

[(u'foo', [(u'baz', 42)]), (u'foo', 7)]

How you use this data structure is up to you. As stated above, Python dictionaries won't allow for duplicate keys, and there's no way around that. How would you even do a lookup based on a key? dct[key] would be ambiguous.

So you can either implement your own logic to handle a lookup the way you expect it to work, or implement some sort of collision avoidance to make keys unique if they're not, and then create a dictionary from your nested list.


Edit: Since you said you would like to modify the duplicate key to make it unique, here's how you'd do that:

from collections import OrderedDict
from json import JSONDecoder


def make_unique(key, dct):
    counter = 0
    unique_key = key

    while unique_key in dct:
        counter += 1
        unique_key = '{}_{}'.format(key, counter)
    return unique_key


def parse_object_pairs(pairs):
    dct = OrderedDict()
    for key, value in pairs:
        if key in dct:
            key = make_unique(key, dct)
        dct[key] = value

    return dct


data = """
{"foo": {"baz": 42, "baz": 77}, "foo": 7, "foo": 23}
"""

decoder = JSONDecoder(object_pairs_hook=parse_object_pairs)
obj = decoder.decode(data)
print obj

Output:

OrderedDict([(u'foo', OrderedDict([(u'baz', 42), ('baz_1', 77)])), ('foo_1', 7), ('foo_2', 23)])

The make_unique function is responsible for returning a collision-free key. In this example it just suffixes the key with _n where n is an incremental counter - just adapt it to your needs.

Because the object_pairs_hook receives the pairs exactly in the order they appear in the JSON document, it's also possible to preserve that order by using an OrderedDict, I included that as well.

Lukas Graf
  • 30,317
  • 8
  • 77
  • 92
  • Thanks, that seems to work, its a bit odd to index, though, but i would like to implement my logic in the function. Is there any way to check that the key appears for the second time using the pairs passed in the hook function? – Greg K. Mar 28 '15 at 20:28
  • @GregKassapidis updated my answer. The new code should be mostly self-explanatory, but let me know if you need clarification on anything. – Lukas Graf Mar 28 '15 at 20:59
  • Thank you very much, your first code was very sufficient to me btw, i worked it out a bit and managed to do practically exactly the same thing. I'll post it below as a comment as well. Yours looks geekier btw i like it better :D – Greg K. Mar 28 '15 at 22:05
  • Btw, do you have any idea about the second issue? How can i filter out those double brackets? – Greg K. Mar 28 '15 at 22:06
  • What exactly do you mean by "double brackets"? The `OrderedDict([(` is just the way an `OrderedDict` is represented if it's printed, otherwise it works just like a normal dict. – Lukas Graf Mar 28 '15 at 22:33
  • It has nothing to do with the OrderedDict class. Please check my first post again. The json.decoder() failes to parse entries that contain another set of brackets in them. – Greg K. Mar 28 '15 at 22:45
  • Ah, sorry, I missed that part. That is simply invalid JSON, pretty much any decoder will fail on that. I don't see a way to handle that other than to clean up the JSON beforehand (either manually or programmatically). – Lukas Graf Mar 28 '15 at 22:49
  • Yeah thats what i am talking about, is there any "smart" way to programmatically clean up these brackets, without ruining anything else? – Greg K. Mar 28 '15 at 23:15
  • Well, no. How should that stray `{"Type":"Something"}` be interpreted? Is there a key missing and it should really be `"somekey": {"Type":"Something"}`? Or is there one too many levels of `{ }` and it should really belong to the `"Test":{}` object? Or should it just be dropped? No parser, not even a forgiving one, can decide that for you. Only the creator of the JSON file (and possibly you) know, so you'll have to fix it in a very specific way. – Lukas Graf Mar 28 '15 at 23:27
2

Thanks a lot @Lukas Graf, i got it working as well by implementing my own version of the hook function

def dict_raise_on_duplicates(ordered_pairs):
  count=0
  d=collections.OrderedDict()
  for k,v in ordered_pairs:
      if k in d:
          d[k+'_dupl_'+str(count)]=v
          count+=1
      else:
          d[k]=v
  return d

Only thing remaining is to automatically get rid of the double brackets and i am done :D Thanks again

Greg K.
  • 686
  • 1
  • 5
  • 18
1

If you would prefer to convert those duplicated keys into an array, instead of having separate copies, this could do the work:

def dict_raise_on_duplicates(ordered_pairs):
    """Convert duplicate keys to JSON array."""
    d = {}
    for k, v in ordered_pairs:
        if k in d:
            if type(d[k]) is list:
                d[k].append(v)
            else:
                d[k] = [d[k],v]
        else:
           d[k] = v
    return d

And then you just use:

dict = json.loads(yourString, object_pairs_hook=dict_raise_on_duplicates) 
ferdymercury
  • 698
  • 4
  • 15