0

I have a JSON object being pulled from an API. I pulled the JSON data into a python dictionary. I'm now finding it difficult to extract data from the dictionary because it is nested and has sublists and sub-dictionaries. To better understand the nature of the data pulled, I tried the below:

url = "https://api.xyz.com/v11/api.json?KEY=abc&LOOKUP=bbb"
response = requests.get(url)
data = response.json()
print (type(data))
print(data.keys())
print (type(data['Results']))
print (len(data['Results']))
print (type(data['Results'][0]))
print (data['Results'][0].keys())
print (type(data['Results'][0]['Result']))
print ((data['Results'][0]['Result'].keys()))
print (type((data['Results'][0]['Result']['Paths'])))
print (len((data['Results'][0]['Result']['Paths'])))
print (type((data['Results'][0]['Result']['Paths'][0])))
print ((data['Results'][0]['Result']['Paths'][0].keys()))
print (type(data['Results'][0]['Result']['Paths'][0]['Technologies']))
print (len(data['Results'][0]['Result']['Paths'][0]['Technologies']))
print ((data['Results'][0]['Result']['Paths'][0]['Technologies'][8].keys()))
print (data['Results'][0]['Result']['Paths'][0]['Technologies'][8]['Tag'])

From the above, I got the following output:

<class 'dict'>
dict_keys(['Results', 'Errors'])
<class 'list'>
1
<class 'dict'>
dict_keys(['Lookup', 'LastIndexed', 'FirstIndexed', 'Meta', 'Result'])
<class 'dict'>
dict_keys(['Paths', 'IsDB', 'Spend'])
<class 'list'>
9
<class 'dict'>
dict_keys(['LastIndexed', 'Technologies', 'Domain', 'SubDomain', 'FirstIndexed', 'Url'])
<class 'list'>
77
dict_keys(['FirstDetected', 'Name', 'LastDetected', 'Categories', 'Description', 'IsPremium', 'Tag', 'Link'])
cdn

From other iterations of this, I know that depending on the list item I choose after 'Paths', I can get a varying list length for 'Technologies' ranging from 5 -100. I'm specifically interested in getting a list of all technologies for which the 'Tag' == A. I want to be able to create a table with all the upper level information for all entries that have the 'Tag' == A. Ideally, I want to get this info in a CSV file. I've looked at Pandas dataframe from nested dictionary and Create a dictionary with list comprehension in Python and Construct pandas DataFrame from items in nested dictionary but get confused when it comes to accessing the list (specially after 'Paths').

So far, the code I have is a simple data dump into a CSV which is not useful at all since all of the data goes into one cell and is not at all usable.

Community
  • 1
  • 1
SizzyNini
  • 275
  • 1
  • 4
  • 11
  • Can you show us the first two elements of your JSON file ? – MMF Nov 09 '16 at 14:58
  • You could perform a FlatDict if that helps https://pypi.python.org/pypi/flatdict and iterate through a flattened out dictionary which helps converting to a CSV format – Robert Lee Nov 09 '16 at 15:08

1 Answers1

0

So I figured out a way to do this. Code below.

import json
import requests
import pandas

domaindict = {}
tech_info = {}
tag = 'mobile'

'''For every domain, pulls the api response, stores in dictionary, and gets dataframe
with technology info by calling the get_paths and set_df functions'''

def get_info(d):
    url = "https://api.xyz.com/v11/api.json?KEY=abc&LOOKUP={}".format(d)
    response = requests.get(url)
    data = response.json()
    domaindict[d] = data
    paths = get_paths(d)
    final_info = set_df(paths)
    return final_info

'''Gets the list of paths for a domain'''        
def get_paths(d):
    paths = domaindict[d]['Results'][0]['Result']['Paths']
    return paths

'''Sets up dataframe to get info at the technology level. Loops through the technology list and path lists.'''
def set_df(paths):
    df_m = pandas.DataFrame({'Domain':[],'Url':[],'Subdomain':[],'Technology Name':[],'Technology Description':[],'Technology Tag':[]})
    for path in paths:
        domain_name = path['Domain']
        url = path['Url']
        subdomain = path['SubDomain']
        techs = path['Technologies']
        for tech in techs: 
            tech_name = tech['Name']
            tech_desc = tech['Description']
            tech_tag = tech['Tag']
            df =  pandas.DataFrame({'Domain':[domain_name], 'Url':[url], 'Subdomain':[subdomain], 
                                    'Technology Name':[tech_name],
                                     'Technology Description': [tech_desc],
                                     'Technology Tag': [tech_tag]})
            df_m = df_m.append(df)
        return df_m


'''loops through the csv file with list of domain names, calls the get_info function and saves dataframe with technology info for
every domain in one file'''

read_domains = pandas.read_excel('domain.xlsx', header = None)
df_f = pandas.DataFrame()
for d in read_domains.values: 
    print(d[0])
    df_i = get_info(d[0])
    df_f = df_f.append(df_i)


with pandas.ExcelWriter('mobile.xlsx') as w:
    df_f.to_excel(w,'mobile')
SizzyNini
  • 275
  • 1
  • 4
  • 11