I have a pandas dataframe of about 2 million rows (80 columns each).
I would like to output the dataframe to csv as well as a parquet file.
Assume dataframe is present in the df
variable
Initial approach:
print('Creating csv and parquet files')
st = time.time()
df.to_csv('output_file.csv')
df.to_parquet('output_file.parquet')
print(f'Created csv and parquet files in {time.time() - st} seconds')
Writing to files using this approach takes too long. I assumed since these two are separate operations, I can take the advantage of multiple processes.
Newer approach:
def build_csv(dataframe, output_filename):
print(f'Building csv: {output_filename}')
dataframe.to_csv(output_filename)
def build_parquet(dataframe, output_filename):
print(f'Building parquet: {output_filename}')
dataframe.to_parquet(output_filename)
with ProcessPoolExecutor(max_workers=3) as executor:
executor.submit(build_csv, (df, 'output_file.csv'))
executor.submit(build_parquet, (df, 'output_file.parquet'))
The newer approach runs successfully but I do not see any files being created. I am not sure why this happens.
Is there a better (quicker) approach for writing a pandas dataframe to different files?