1

I have pandas dataframe in which one column has its entries in nested json form like this as shown below

Date              ID              Request

12/01/2020        123            {"requestChannel":"amazon","userID":"123@abc.com","Context":"
                                 {\"ref_id\":[\"1-123456\"],\"type\":\"Apparels\","formJson":null,"additionalData":null},"additional":
                                 {"ref_id":"1-12345","last_saved":"order"}}

and thousands of rows like this in same format

I want to take out just the "type" in nested second json on the basis of "ID" like this:

Output

Date             ID        Request

12/01/2020       123       Apparels

What i am trying to do is to parse the Request column of dataframe as json iterate over entire row and take the "type" key from second "Context" json and take the key type

for name in range(data.shape[0]):
        output_json=data['Request'][name]
        json_data=json.loads(output_json)
        try:
            parse_json=json.loads(json_data['Request'])
            value_to_insert=''
            for json_key in parse_json:
                value_to_insert+=str(json_key + '--'+ str(parse_json[json_key])+', '+'\n')

But my code is not changing the dataframe series to correct json form and also iterating over the entire rows is taking a lot of time. Any idea on optimized solution would be appreciated

think-maths
  • 917
  • 2
  • 10
  • 28

1 Answers1

0

Try avoiding loops. Even if you are looping use pandas's DataFrame.iterrows to do so. For the cases of extracting or updating DataFrame.apply will be best for most cases.

If Request column is of type string, then you can do

import json
data['Request'] = data['Request'].apply(lambda x: json.loads(x).get("Context",{}).get("type"))

If your Request column can be None, then you should do

import json
data['Request'] = data['Request'].apply(lambda x: json.loads(x).get("Context",{}).get("type") if x else None)

Edit: Looks like you've a string in Context, You'll have to parse multiple times

import json
data['Request'] = data['Request'].apply(lambda x: json.loads(json.loads(x).get("Context","{}")).get("type") if x else None)
Raghava Dhanya
  • 959
  • 15
  • 22
  • 1
    @jezrael Yes, in most cases. It can give you improvements if you are careful. I tend to follow https://stackoverflow.com/questions/24870953/does-pandas-iterrows-have-performance-issues/24871316#24871316. I should probably change the wording a bit. – Raghava Dhanya Jan 24 '20 at 07:58
  • I tried the solution proposed by you however while running i got the ```AttributeError: 'str' object has no attribute 'get'``` then I added exception to that and still it did not run for any row. – think-maths Jan 24 '20 at 09:28
  • Did you get the output as expected – think-maths Jan 24 '20 at 09:32
  • @SachinKumar Looks like you've string in `Context`, I've updated answer to parse twice, try that out – Raghava Dhanya Jan 24 '20 at 09:46
  • I tried your approach of parsing twice but its throwing json decode error ```JSONDecodeError: Expecting value: line 1 column 1 (char 0)``` seems parsing twice have issues..trying to get any breakthrough to parse nested jsons – think-maths Jan 24 '20 at 10:37