I have json datafiles in several directories that I want to import into Pandas to do some data analysis. The format of the json depends on the type defined in the directory name. For example,
dir1_typeA/
file1
file2
...
dir1_typeB/
file1
file2
...
dir2_typeB/
file1
...
dir2_typeA/
file1
file2
Each file
contains a complex nested json string that will be a row of the DataFrame. I will have two data frames for each TypeA and TypeB. Later on I will append them if needed.
So, far I've got all the files paths I need with os.walk and am trying to go through
import os
from glob import glob
PATH = 'dir/filepath'
files = [y for x in os.walk(PATH) for y in glob(os.path.join(x[0], 'file*'))]
for file in files:
with open(issuefile, 'r') as f:
data = f.read()
data_json = json_normalize(json.loads(data))
type = ' '.join(issuefile.split('/')[3]
data_json['type'] = type
# append to data frame for typeA and typeB
if 'typeA' in type:
# append to typeA dataframe
else:
# append to typeB dataframe
There is one added issue, which is files inside a directory may have slightly different fields. For example, file1
may have a few more fields that file2
in dir1_typeA
. So, I need to accommodate that dynamic nature in data frame for each type as well.
How do I create these two dataframes?