Reading JSON files adding file name column without append()

Question

I have 1500 json files like:

[
  {
    "info1": {
      "name": "John",
      "age" : "50"
      "country": "USA",
    },
    "info2": {
      "id1": "129",
      "id2": "151",
      "id3": "196",
    },
    "region": [
      {
        "id": "36",
        "name": "Spook",
        "spot": "2"
      },
      {
        "id": "11",
        "name": "Ghoul",
        "spot": "6"
      },
      {
        "id": "95",
        "lat": "Devil",
        "spot": "4"
      }
    ]
  }
  {
    "info1": {
      "name": "Mark",
      "age" : "33"
      "country": "Brasil",
    },
    "info2": {
      "id1": "612",
      "id2": "221",
      "id3": "850",
    },
    "region": [
      {
        "id": "68",
        "name": "Ghost",
        "spot": "7"
      },
      {
        "id": "75",
        "name": "Spectrum",
        "spot": "2"
      },
      {
        "id": "53",
        "name": "Phantom",
        "spot": "2"
      }
    ]
  }
]

I have loaded important information from json files into the data frame and I added column with json file name. My code:

path_to_json = 'my files_directory' 

json_files = glob.glob(os.path.join(path_to_json, "*.json"))


for file_ in json_files:

    df = pd.read_json(file_)
    df = df.drop(columns=['info1', 'info2'])  # these columns is not important to me so I delete it

    df2 = pd.DataFrame(columns=['name', 'date'])
    names=[]
    dates=[]

    for x in df['region']: 

        for name in x: 

            names.append(name['name']) 
            dates.append(file_)

df2['name']=names
df2['date']=dates

My data frame look like this:

      name           date  
0    Spook      20191111.json  
1    Ghoul      20191111.json  
2    Devil      20191111.json  
3    Ghost      20191111.json  
4    Spectrum   20191111.json  
5    Phantom    20191111.json

And this output it's satisfying for me, but when I have 1500 json files in the folder, loading this into data frame takes ages. This probably results from using the append() function. How can I modify this code to speed up loading this json files?

Thank you in advance for your help.

You should see better performance if you create separate dataframes, and then combine them using a [list comprehension](https://stackoverflow.com/a/46661368/4739755) at the end instead. — b_c, Nov 19 '19 at 17:43

score 0 · Accepted Answer · answered Nov 19 '19 at 19:00

We can do this by using numpy flatten and pandas pandas concat. Please check out the following code for your reference.

import numpy as np

path_to_json = 'my files_directory' 

json_files = glob.glob(os.path.join(path_to_json, "*.json"))

main_df = pd.DataFrame() # Dataframe which is going to carry data from all json files

for file_ in json_files:

    df = pd.read_json(file_)
    df = df.drop(columns=['info1', 'info2'])  # these columns is not important to me so I delete it

    regions = np.array(df['region'].tolist()).flatten() # converting array of array of objects into array of objects

    regions_df = pd.DataFrame(regions.tolist()) # tolist() method helps to convert numpy array to array

    regions_df["date"] = file_
    main_df = pd.concat([main_df, regions_df])

FYI: If we use pyspark, we can do this by using explode in spark. in Pandas also explode is available in 0.25

Hope this helps you on improving the performance

Reading JSON files adding file name column without append()

1 Answers1