Count unique items in JSON file using Pandas

Question

Given the following JSON file to Pandas (df = pd.read_json(file)):

[
    {
        "Name": "foo",
        "Details": {
            "Vendor": "Microsoft",
            "Item": "aaa"
        }
    },
    {
        "Name": "bar",
        "Details": {
            "Vendor": "Microsoft",
            "Item": "bbb"
        }
    },
    {
        "Name": "baz",
        "Details": {
            "Vendor": "Microsoft",
            "Item": "ccc"
        }
    },
    {
        "Name": "baz2",
        "Details": {
            "Vendor": "Microsoft",
            "Item": "ccc"
        }
    },
    {
        "Name": "qux",
        "Details": {
            "Vendor": "IBM",
            "Item": "aaa"
        }
    }
]

I want to perform unique counts of items in the JSON file. I want to know the number of unique vendors, and the number of unique vendor-item combinations. With the above JSON, there are 2 unique vendors (Microsoft and IBM) and there are 4 unique vendor-item combinations (baz and baz2 are duplicates).

I believe my current attempts have failed because I have JSON stored inside of my DataFrame.

df = pd.read_json(file)
print(df)

Outputs:

   Name                                 Details
0   foo  {'Vendor': 'Microsoft', 'Item': 'aaa'}
1   bar  {'Vendor': 'Microsoft', 'Item': 'bbb'}
2   baz  {'Vendor': 'Microsoft', 'Item': 'ccc'}
3  baz2  {'Vendor': 'Microsoft', 'Item': 'ccc'}
4   qux        {'Vendor': 'IBM', 'Item': 'aaa'}

I've also attempted the following: print(df.groupby("Details").Vendor.nunique()) which results in the error:

AttributeError: 'DataFrameGroupBy' object has no attribute 'Vendor'

score 1 · Answer 1 · answered Jul 26 '19 at 15:48

A related question has been asked here. What you are looking for is json_normalize. Little change in the code below.

from pandas.io.json import json_normalize 
a = [
    {
        "Name": "foo",
        "Details": {
            "Vendor": "Microsoft",
            "Item": "aaa"
        }
    },
    {
        "Name": "bar",
        "Details": {
            "Vendor": "Microsoft",
            "Item": "bbb"
        }
    },
    {
        "Name": "baz",
        "Details": {
            "Vendor": "Microsoft",
            "Item": "ccc"
        }
    },
    {
        "Name": "baz2",
        "Details": {
            "Vendor": "Microsoft",
            "Item": "ccc"
        }
    },
    {
        "Name": "qux",
        "Details": {
            "Vendor": "IBM",
            "Item": "aaa"
        }
    }
]
Loaded = json_normalize(a)
print(Loaded)

Gives me the following output. Now you can do normal aggregation queries over it.

This example works when you save the JSON in the scirpt like this. However, when reading from a file using `df = pd.read_json(file)` I get an error executing `json_normalize` that says: "AttributeError: 'str' object has no attribute 'values'" — roundtheworld, Jul 26 '19 at 17:08
So I'm pretty sure you can get around this by first `import json` then `json_normalize(json.loads(filename))` — Ben Pap, Jul 26 '19 at 19:46

score 0 · Answer 2 · answered Jul 26 '19 at 16:07

You may just prepare/filter your JSON data before feeding into pandas:

import pandas as pd
import json

data = json.load(open('input.json'))
df = pd.DataFrame([d['Details'] for d in data])
uniq_vendors = df['Vendor'].unique()
uniq_vendor_items = df.drop_duplicates(['Vendor', 'Item'])

print(uniq_vendors)   # output below
['Microsoft' 'IBM']

print(uniq_vendor_items.to_records(index=False))   # output below
[('aaa', 'Microsoft') ('bbb', 'Microsoft') ('ccc', 'Microsoft')
 ('aaa', 'IBM')]

score 0 · Answer 3 · answered Jul 26 '19 at 17:44

0

Here is another approach. You may try this.

from pandas.io.json import json_normalize
import pandas as pd
a = pd.read_json("/PathToYourJSON/TempDelete.json")
a['Vendor']=a['Details'].apply(lambda x: x['Vendor'])
a['Item']=a['Details'].apply(lambda x: x['Item'])
a.drop(columns="Details",inplace=True)
print(a)

This gives the output as below. Please let me know if this is what you wanted.

answered Jul 26 '19 at 17:44

Amit

2,018
1
8
12

I've also gotten it working using `groupby`...is there a preference or a downside? i.e. `unique_vendor_items = df.groupby(["Vendor", "Item"]).count()` then I can do `print(len(unique_vendor_items))` and I get the same result. – roundtheworld Jul 26 '19 at 17:53
Will you please paste the print output here (only if you can)? – Amit Jul 26 '19 at 17:58
@roundtheworld The code that I have written above will only give you the df. The aggregation statistics remains to be computed. You can do it using groupby, as you have already suggested. – Amit Jul 26 '19 at 18:04
`data = json.load(open(r"test.json", "r", encoding="utf8")) df = pd.DataFrame([d["Details"] for d in data]) unique_vendor_products = df.drop_duplicates(["Vendor", "Item"]) unique_vendor_products2 = df.groupby(["Vendor", "Item"]).count() print(unique_vendor_products) print(f"Count: {len(unique_vendor_products)}") print("---------------------") print(unique_vendor_products2) print(f"Count: {len(unique_vendor_products2)}")` Note, I tried this on a much larger data set than I provided and got the same count for both methods. – roundtheworld Jul 26 '19 at 18:07

Count unique items in JSON file using Pandas

3 Answers3