0

I have a JSON file which I'm downloading from an API. Currently, I've been able to export it to a JSON and via an Excel Power Query parse the data accurately.

The data is divided by Campaign IDS (in this case, only two), then, by each day of the selected period, there are a couple of different metrics associated. For example, these are a few (incomplete) lines for you to see how it is supposed to work.

campaignId  metadata.id  metrics.impressions   metrics.clicks
s00821idk   2019-05-19   12000293121           100
s00821idk   2019-05-18   12300223151           103

I've tried working with Excel to parse this data, which kind of defeats the purpose of using the API. Since I'm exporting from Python, running it through Excel and then placing it in a Google Sheet.

I would like to do all the transformations in Python, so that I can use Google Sheets API and place it in there.

In the following link, I provide the exported JSON file. file

Would love if you could help me structure the data in that way. Thank you so much.

John ED
  • 55
  • 6
  • 1
    You need to [flatten the JSON](https://stackoverflow.com/questions/1871524/how-can-i-convert-json-to-csv). If you want a more complex CSV layout you will need to run through the JSON as a dict with `json.load()` (or as an [OrderedDict](https://stackoverflow.com/questions/6921699/can-i-get-json-to-load-into-an-ordereddict)) and write the nested values as you find suitable – Matt M May 22 '19 at 14:34

2 Answers2

0

As stated, you'll need to completely flatten the multiple nested values, the iterate through to get what you want. It can be done, but it is pretty big (24,000+ columns for each campaign id), so takes 2 minutes to iterate through the whole thing you provided.

import json
import pandas as pd
import re


with open('C:/data.json') as f:
    jsonObj = json.load(f)


def flatten_json(y):
    out = {}
    def flatten(x, name=''):
        if type(x) is dict:
            for a in x:
                flatten(x[a], name + a + '_')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, name + str(i) + '_')
                i += 1
        else:
            out[name[:-1]] = x
    flatten(y)
    return out


flat = flatten_json(jsonObj)

results = pd.DataFrame()
special_cols = []

columns_list = list(flat.keys())
for item in columns_list:
    try:
        row_idx = re.findall(r'\_(\d+)\_', item )[0]
    except:
        special_cols.append(item)
        continue
    column = re.findall(r'\_\d+\_(.*)', item )[0]
    column = column.replace('_', '')

    row_idx = int(row_idx)
    value = flat[item]

    results.loc[row_idx, column] = value


for item in special_cols:
    results[item] = flat[item]

results.to_csv('file.csv', index=False)

Output:

print (results)
                           campaignId  ... totalCampaigns
0  0081da282b2dbe8140508074366cac91ba  ...              2
1  00c03d801da285767a093d0b4d5188fb34  ...              2

[2 rows x 24533 columns]
chitown88
  • 27,527
  • 4
  • 30
  • 59
  • Thank you so much for your answer. I would need to have each date on each row, with each date with its corresponding metrics and ID. I guess it's too complicated & heavy to do anyway, so I might simply export it manually from the platform. If only the data from the API came already in CSV format, it would be much simpler. – John ED May 22 '19 at 16:30
0

IIUC - what about the following approach.
It iterates over all campaignResults, iterates in each one over all results and writes out campaignID, metadata.id and (as an example) metrics.impressions and metrics.clicks per row:

import json
sep = '\t'
with open(jsonfile) as jsonin, open('j2c.csv', 'w') as f:
    j = json.load(jsonin)
    f.write(f'campaignId{sep}metadata.id{sep}metrics.impressions{sep}metrics.clicks\n')
    for cR in range(j['totalCampaigns']):
        for r in range(j['campaignResults'][cR]['totalResults']):
            f.write(j['campaignResults'][cR]['campaignId']+ sep)
            f.write(j['campaignResults'][cR]['results'][r]['metadata']['id']+ sep)
            f.write(str(j['campaignResults'][cR]['results'][r]['metrics']['impressions']) + sep)
            f.write(str(j['campaignResults'][cR]['results'][r]['metrics']['clicks']) + '\n')

result:

# campaignId    metadata.id metrics.impressions metrics.clicks
# 0081da282b2dbe8140508074366cac91ba    2019-05-20  176430.0    59.0
# 0081da282b2dbe8140508074366cac91ba    2019-05-19  169031.0    59.0
# 0081da282b2dbe8140508074366cac91ba    2019-05-18  108777.0    62.0
# 0081da282b2dbe8140508074366cac91ba    2019-05-17  272088.0    60.0
# 0081da282b2dbe8140508074366cac91ba    2019-05-16  198100.0    62.0
# ...
# 00c03d801da285767a093d0b4d5188fb34    2018-01-10  0.0 0.0
# 00c03d801da285767a093d0b4d5188fb34    2018-01-09  0.0 0.0
# 00c03d801da285767a093d0b4d5188fb34    2018-01-08  0.0 0.0
# 00c03d801da285767a093d0b4d5188fb34    2018-01-07  0.0 0.0
# 00c03d801da285767a093d0b4d5188fb34    2018-01-06  0.0 0.0

I still do not really understand exactly what data you want to extract - only values which have a date-like pattern or only specific dates?
Besides that, I do not really get the structure of your json file, so I tried to create a tree print out which may help to get a clearer view and to formulate the question more precisely:

with open(file) as f:
    j = json.load(f)

def getStructure(dct, ind=''):
    indsym = '.\t'
    for k, v in dct.items():
        if type(v) is list:
            print(f'{ind}{k}[{len(v)}]')
            getStructure(v[0], ind + indsym)
        elif type(v) is dict:
            print(f'{ind}{k}')
            getStructure(v, ind + indsym)
        else:
            print(f'{ind}{k}')

getStructure(j)

result:

# campaignResults[2]
# .       campaignId
# .       results[500]
# .       .       metadata
# .       .       .       id
# .       .       .       fromDate
# .       .       .       toDate
# .       .       .       lastCappingTime
# .       .       metrics
# .       .       .       impressions
# .       .       .       clicks
# .       .       .       conversions
# .       .       .       spend
# .       .       .       ecpc
# .       .       .       ctr
# .       .       .       conversionRate
# .       .       .       cpa
# .       .       .       totalValue
# .       .       .       averageValue
# .       .       .       conversionMetrics[6]
# .       .       .       .       name
# .       .       .       .       conversions
# .       .       .       .       conversionRate
# .       .       .       .       cpa
# .       .       .       .       totalValue
# .       .       .       .       averageValue
# .       totalResults
# totalCampaigns

a little problem here: I think that there are not always the same keys in similar list elements:

j['campaignResults'][0]['results'][0]['metadata'].keys()
# dict_keys(['id', 'fromDate', 'toDate', 'lastCappingTime'])

j['campaignResults'][1]['results'][0]['metadata'].keys()
# dict_keys(['id', 'fromDate', 'toDate'])

So note that the above getStructure function only looks into the first element of a list to get the structure of that one.

SpghttCd
  • 10,510
  • 2
  • 20
  • 25