0

Given the following JSON file to Pandas (df = pd.read_json(file)):

[
    {
        "Name": "foo",
        "Details": {
            "Vendor": "Microsoft",
            "Item": "aaa"
        }
    },
    {
        "Name": "bar",
        "Details": {
            "Vendor": "Microsoft",
            "Item": "bbb"
        }
    },
    {
        "Name": "baz",
        "Details": {
            "Vendor": "Microsoft",
            "Item": "ccc"
        }
    },
    {
        "Name": "baz2",
        "Details": {
            "Vendor": "Microsoft",
            "Item": "ccc"
        }
    },
    {
        "Name": "qux",
        "Details": {
            "Vendor": "IBM",
            "Item": "aaa"
        }
    }
]

I want to perform unique counts of items in the JSON file. I want to know the number of unique vendors, and the number of unique vendor-item combinations. With the above JSON, there are 2 unique vendors (Microsoft and IBM) and there are 4 unique vendor-item combinations (baz and baz2 are duplicates).

I believe my current attempts have failed because I have JSON stored inside of my DataFrame.

df = pd.read_json(file)
print(df)

Outputs:

   Name                                 Details
0   foo  {'Vendor': 'Microsoft', 'Item': 'aaa'}
1   bar  {'Vendor': 'Microsoft', 'Item': 'bbb'}
2   baz  {'Vendor': 'Microsoft', 'Item': 'ccc'}
3  baz2  {'Vendor': 'Microsoft', 'Item': 'ccc'}
4   qux        {'Vendor': 'IBM', 'Item': 'aaa'}

I've also attempted the following: print(df.groupby("Details").Vendor.nunique()) which results in the error:

AttributeError: 'DataFrameGroupBy' object has no attribute 'Vendor'

roundtheworld
  • 2,651
  • 4
  • 32
  • 51

3 Answers3

1

A related question has been asked here. What you are looking for is json_normalize. Little change in the code below.

from pandas.io.json import json_normalize 
a = [
    {
        "Name": "foo",
        "Details": {
            "Vendor": "Microsoft",
            "Item": "aaa"
        }
    },
    {
        "Name": "bar",
        "Details": {
            "Vendor": "Microsoft",
            "Item": "bbb"
        }
    },
    {
        "Name": "baz",
        "Details": {
            "Vendor": "Microsoft",
            "Item": "ccc"
        }
    },
    {
        "Name": "baz2",
        "Details": {
            "Vendor": "Microsoft",
            "Item": "ccc"
        }
    },
    {
        "Name": "qux",
        "Details": {
            "Vendor": "IBM",
            "Item": "aaa"
        }
    }
]
Loaded = json_normalize(a)
print(Loaded)

Gives me the following output. Now you can do normal aggregation queries over it. DF ready for aggregation, as seen here

Amit
  • 2,018
  • 1
  • 8
  • 12
  • This example works when you save the JSON in the scirpt like this. However, when reading from a file using `df = pd.read_json(file)` I get an error executing `json_normalize` that says: "AttributeError: 'str' object has no attribute 'values'" – roundtheworld Jul 26 '19 at 17:08
  • So I'm pretty sure you can get around this by first `import json` then `json_normalize(json.loads(filename))` – Ben Pap Jul 26 '19 at 19:46
0

You may just prepare/filter your JSON data before feeding into pandas:

import pandas as pd
import json

data = json.load(open('input.json'))
df = pd.DataFrame([d['Details'] for d in data])
uniq_vendors = df['Vendor'].unique()
uniq_vendor_items = df.drop_duplicates(['Vendor', 'Item'])

print(uniq_vendors)   # output below
['Microsoft' 'IBM']

print(uniq_vendor_items.to_records(index=False))   # output below
[('aaa', 'Microsoft') ('bbb', 'Microsoft') ('ccc', 'Microsoft')
 ('aaa', 'IBM')]
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105
0

Here is another approach. You may try this.

from pandas.io.json import json_normalize
import pandas as pd
a = pd.read_json("/PathToYourJSON/TempDelete.json")
a['Vendor']=a['Details'].apply(lambda x: x['Vendor'])
a['Item']=a['Details'].apply(lambda x: x['Item'])
a.drop(columns="Details",inplace=True)
print(a)

This gives the output as below. Please let me know if this is what you wanted.

This is the output I see here

Amit
  • 2,018
  • 1
  • 8
  • 12
  • I've also gotten it working using `groupby`...is there a preference or a downside? i.e. `unique_vendor_items = df.groupby(["Vendor", "Item"]).count()` then I can do `print(len(unique_vendor_items))` and I get the same result. – roundtheworld Jul 26 '19 at 17:53
  • Will you please paste the print output here (only if you can)? – Amit Jul 26 '19 at 17:58
  • @roundtheworld The code that I have written above will only give you the df. The aggregation statistics remains to be computed. You can do it using groupby, as you have already suggested. – Amit Jul 26 '19 at 18:04
  • `data = json.load(open(r"test.json", "r", encoding="utf8")) df = pd.DataFrame([d["Details"] for d in data]) unique_vendor_products = df.drop_duplicates(["Vendor", "Item"]) unique_vendor_products2 = df.groupby(["Vendor", "Item"]).count() print(unique_vendor_products) print(f"Count: {len(unique_vendor_products)}") print("---------------------") print(unique_vendor_products2) print(f"Count: {len(unique_vendor_products2)}")` Note, I tried this on a much larger data set than I provided and got the same count for both methods. – roundtheworld Jul 26 '19 at 18:07