Is it possible to save a pandas data frame directly to a parquet file? If not, what would be the suggested process?
The aim is to be able to send the parquet file to another team, which they can use scala code to read/open it. Thanks!
Is it possible to save a pandas data frame directly to a parquet file? If not, what would be the suggested process?
The aim is to be able to send the parquet file to another team, which they can use scala code to read/open it. Thanks!
Pandas has a core function to_parquet()
. Just write the dataframe to parquet format like this:
df.to_parquet('myfile.parquet')
You still need to install a parquet library such as fastparquet
. If you have more than one parquet library installed, you also need to specify which engine you want pandas to use, otherwise it will take the first one to be installed (as in the documentation). For example:
df.to_parquet('myfile.parquet', engine='fastparquet')
Assuming, df
is the pandas dataframe. We need to import following libraries.
import pyarrow as pa
import pyarrow.parquet as pq
First, write the dataframe df
into a pyarrow
table.
# Convert DataFrame to Apache Arrow Table
table = pa.Table.from_pandas(df_image_0)
Second, write the table
into parquet
file say file_name.parquet
# Parquet with Brotli compression
pq.write_table(table, 'file_name.parquet')
Parquet with Snappy compression
pq.write_table(table, 'file_name.parquet')
Parquet with GZIP compression
pq.write_table(table, 'file_name.parquet', compression='GZIP')
Parquet with Brotli compression
pq.write_table(table, 'file_name.parquet', compression='BROTLI')
Reference: https://tech.blueyonder.com/efficient-dataframe-storage-with-apache-parquet/
There is a relatively early implementation of a package called fastparquet - it could be a good use case for what you need.
https://github.com/dask/fastparquet
conda install -c conda-forge fastparquet
or
pip install fastparquet
from fastparquet import write
write('outfile.parq', df)
or, if you want to use some file options, like row grouping/compression:
write('outfile2.parq', df, row_group_offsets=[0, 10000, 20000], compression='GZIP', file_scheme='hive')
Yes, it is possible. Here is example code:
import pyarrow as pa
import pyarrow.parquet as pq
df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
table = pa.Table.from_pandas(df, preserve_index=True)
pq.write_table(table, 'output.parquet')
this is the approach that worked for me - similar to the above - but also chose to stipulate the compression type:
set up test dataframe
df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
convert data frame to parquet and save to current directory
df.to_parquet('df.parquet.gzip', compression='gzip')
read the parquet file in current directory, back into a pandas data frame
pd.read_parquet('df.parquet.gzip')
output:
col1 col2
0 1 3
1 2 4
Pandas directly support parquet so-
df.to_parquet('df.parquet.gzip', compression='gzip')
# this will convert the df to parquet format
df_parquet = pd.read_parquet('df.parquet.gzip')
# This will read the parquet file
df.to_csv('filename.csv')
# this will convert back the parquet to CSV
Yup, quite possible to write a pandas dataframe
to the binary parquet
format. Some additional libraries are required like pyarrow
and fastparquet
.
import pyarrow
import pandas as pd
#read parquet file into pandas dataframe
df=pd.read_parquet('file_location/file_path.parquet',engine='pyarrow')
#writing dataframe back to source file
df.to_parquet('file_location/file_path.parquet', engine='pyarrow')