0

I'm struggling with converting of local json files into parquet files. Each file should be converted with pandas to a parquet file and save it, so i have the same amount of files, just as parquets.

I looped through my directory and became a list of all my json files existing and put them into a pandas dataframe.

path = 'trackingdata/'

df = list()
for root, dirs, files in os.walk(path, topdown=False):
   for name in files:
      df.append(os.path.join(root, name))
df = pd.DataFrame(df)     

Is it better to loop trough the dataframe now and transform each file with

df.to_parquet('trackingdata.parquet')

or would it be better to write the transformation into the code above after looping through the dir? And how can i transform each of the files to parquet without joining all together?

stained
  • 3
  • 1

1 Answers1

0

How about defining a json_to_parquet converter:

def json_to_parquet(filepath):
    df = pd.read_json(filepath, typ='series').to_frame("name")
    parquet_file = filepath.split(".")[0] + ".parquet"
    df.to_parquet(parquet_file)

Depending on how your json is formatted you may need to change the read_json line and/or use the tips here

Then just processing each file one at at time:

path = 'trackingdata/'

for root, dirs, files in os.walk(path, topdown=False):
    for name in files:
        json_to_parquet(os.path.join(root, name))
dataflow
  • 475
  • 2
  • 12