1

I have 1500 json files like:

[
  {
    "info1": {
      "name": "John",
      "age" : "50"
      "country": "USA",
    },
    "info2": {
      "id1": "129",
      "id2": "151",
      "id3": "196",
    },
    "region": [
      {
        "id": "36",
        "name": "Spook",
        "spot": "2"
      },
      {
        "id": "11",
        "name": "Ghoul",
        "spot": "6"
      },
      {
        "id": "95",
        "lat": "Devil",
        "spot": "4"
      }
    ]
  }
  {
    "info1": {
      "name": "Mark",
      "age" : "33"
      "country": "Brasil",
    },
    "info2": {
      "id1": "612",
      "id2": "221",
      "id3": "850",
    },
    "region": [
      {
        "id": "68",
        "name": "Ghost",
        "spot": "7"
      },
      {
        "id": "75",
        "name": "Spectrum",
        "spot": "2"
      },
      {
        "id": "53",
        "name": "Phantom",
        "spot": "2"
      }
    ]
  }
]

I have loaded important information from json files into the data frame and I added column with json file name. My code:

path_to_json = 'my files_directory' 

json_files = glob.glob(os.path.join(path_to_json, "*.json"))


for file_ in json_files:

    df = pd.read_json(file_)
    df = df.drop(columns=['info1', 'info2'])  # these columns is not important to me so I delete it

    df2 = pd.DataFrame(columns=['name', 'date'])
    names=[]
    dates=[]

    for x in df['region']: 

        for name in x: 

            names.append(name['name']) 
            dates.append(file_)

df2['name']=names
df2['date']=dates    

My data frame look like this:

      name           date  
0    Spook      20191111.json  
1    Ghoul      20191111.json  
2    Devil      20191111.json  
3    Ghost      20191111.json  
4    Spectrum   20191111.json  
5    Phantom    20191111.json  

And this output it's satisfying for me, but when I have 1500 json files in the folder, loading this into data frame takes ages. This probably results from using the append() function. How can I modify this code to speed up loading this json files?

Thank you in advance for your help.

  • 2
    You should see better performance if you create separate dataframes, and then combine them using a [list comprehension](https://stackoverflow.com/a/46661368/4739755) at the end instead. – b_c Nov 19 '19 at 17:43

1 Answers1

0

We can do this by using numpy flatten and pandas pandas concat. Please check out the following code for your reference.

import numpy as np

path_to_json = 'my files_directory' 

json_files = glob.glob(os.path.join(path_to_json, "*.json"))

main_df = pd.DataFrame() # Dataframe which is going to carry data from all json files

for file_ in json_files:

    df = pd.read_json(file_)
    df = df.drop(columns=['info1', 'info2'])  # these columns is not important to me so I delete it

    regions = np.array(df['region'].tolist()).flatten() # converting array of array of objects into array of objects

    regions_df = pd.DataFrame(regions.tolist()) # tolist() method helps to convert numpy array to array

    regions_df["date"] = file_
    main_df = pd.concat([main_df, regions_df])

FYI: If we use pyspark, we can do this by using explode in spark. in Pandas also explode is available in 0.25

Hope this helps you on improving the performance

Kumar KS
  • 873
  • 10
  • 21