0

I'm new to python. Tried to build code for my case using this link

I have 200k json files and need to put it in a dataframe.

To achieve that i made sample case for my problem which looks like this.

I have a folder 'test' with json files in sub directories. Directory looks like this:

test>test1>test1> 3 json files 
test>test2>test2> 3 json files 
test>test3>test3> 3 json files

I'm trying to get all the 9 json files into a dataframe. My code is as follows:

import json
import os
import pandas as pd
import json

jpath='C:\\Users\\Sharath\\Desktop\\test'

result=[]
for i in os.listdir(jpath):
    k=os.path.join(jpath,i)
    for j in os.listdir(k):
        l=os.path.join(k,j)
        result.append(l)
print(result)

['C:\\Users\\Sharath\\Desktop\\test\\test1\\test1', 'C:\\Users\\Sharath\\Desktop\\test\\test2\\test2', 'C:\\Users\\Sharath\\Desktop\\test\\test3\\test3']

jsons_data = pd.DataFrame(columns=['homepage_url', 'number_of_employees', 'email_address'])
for i in range(len(result)):
    for j in os.listdir(result[i]):
        with open(os.path.join(result[i],j)) as jfile:
            jtext=json.load(jfile)         
            homepage_url = jtext['homepage_url']
            number_of_employees = jtext['number_of_employees']
            email_address = jtext['email_address']
            jsons_data.loc[index]=[homepage_url,number_of_employees,email_address]
        print(jsons_data)


           homepage_url  number_of_employees          email_address
2  http://www.01tek.com                  1.0  khouidi.you@gmail.com
              homepage_url  number_of_employees      email_address
2  http://www.123listo.com                  NaN  info@123listo.com
                      homepage_url  number_of_employees email_address
2  http://www.immortaloutdoors.com                  NaN              
                         homepage_url  number_of_employees  \
2  http://www.1on1fitnesstraining.com                 50.0   

                      email_address  
2  1on1fitnesstraining013@gmail.com  
                    homepage_url  number_of_employees        email_address
2  http://1onlybat.bigcartel.com                  NaN  office@1onlybat.com
           homepage_url  number_of_employees       email_address
2  http://www.1doc3.com                  5.0  contacto@1doc3.com
           homepage_url  number_of_employees email_address
2  http://1phoneapp.com                 10.0              
  homepage_url  number_of_employees email_address
2         None                  NaN              
                        homepage_url  number_of_employees  \
2  http://www.1stalliancelending.com                 51.0   

              email_address  
2  info@placewelovemost.com  

When i tried to look at the dataframe by calling jsons_data, the result is as follows:

enter image description here

I'm unable to understand why am i getting only one result with index 2. Please help me how to get all the 9 files in dataframe using this method.

Community
  • 1
  • 1
  • I don't think you've assigned or updated the variable `index` anywhere in your code. If so, then it's always equal to zero, and your final list contains a single result, whichever was the last file processed. – K. A. Buhr Dec 18 '16 at 14:20
  • Have you considered pd.read_json() and then concatenating the dataframes? How big are the files on average? This might be an option. If you want to follow the approach in the link you should update index (look again carefully at the link you use, they update the index variable) as mentioned by @K.A.Buhr – fedepad Dec 18 '16 at 15:26

1 Answers1

0

There are few ways you can do this. One is to use pd.read_json() and then concatenating the dataframes. Assuming you want to use the approach in the link you mentioned, you need to update the variable index in the loop, so you could change your code so that your main loop would be:

for i in range(len(result)):
    for index, j in enumerate(os.listdir(result[i])):
        with open(os.path.join(result[i],j)) as jfile:
            jtext=json.load(jfile)         
            homepage_url = jtext['homepage_url']
            number_of_employees = jtext['number_of_employees']
            email_address = jtext['email_address']
            jsons_data.loc[index]=[homepage_url,number_of_employees,email_address]
        print(jsons_data)

Notice how the line:

for j in os.listdir(result[i]):

changed into

for index, j in enumerate(os.listdir(result[i])):
fedepad
  • 4,509
  • 1
  • 13
  • 27