Python: save pandas data frame to parquet file

Question

Is it possible to save a pandas data frame directly to a parquet file? If not, what would be the suggested process?

The aim is to be able to send the parquet file to another team, which they can use scala code to read/open it. Thanks!

Is the other team using Spark or some other Scala tools? Loading CSV is Spark is pretty trivial — evan.oman, Dec 09 '16 at 19:07
If you have `pyspark` you can do something like [this](https://gist.github.com/jiffyclub/905bf5e8bf17ec59ab8f#file-hdf_to_parquet-py) — evan.oman, Dec 09 '16 at 19:08

score 68 · Answer 1 · edited May 10 '18 at 18:06

68

Pandas has a core function to_parquet(). Just write the dataframe to parquet format like this:

df.to_parquet('myfile.parquet')

You still need to install a parquet library such as fastparquet. If you have more than one parquet library installed, you also need to specify which engine you want pandas to use, otherwise it will take the first one to be installed (as in the documentation). For example:

df.to_parquet('myfile.parquet', engine='fastparquet')

edited May 10 '18 at 18:06

haimco

100
1
7

answered Mar 10 '18 at 12:05

ben26941

1,580
14
20

Running this in Databricks 7.1 (python 3.7.5) , I get _'DataFrame' object has no attribute 'toParquet'_ – Nick.Mc Aug 20 '20 at 07:20
Well, that seems to be an easy one: there is no toParquet, no. It's to_parquet. Cheers! https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_parquet.html – ben26941 Aug 21 '20 at 08:09

score 28 · Answer 2 · edited Oct 28 '20 at 17:57

Yes pandas supports saving the dataframe in parquet format.

Simple method to write pandas dataframe to parquet.

Assuming, df is the pandas dataframe. We need to import following libraries.

import pyarrow as pa
import pyarrow.parquet as pq

First, write the dataframe df into a pyarrow table.

# Convert DataFrame to Apache Arrow Table
table = pa.Table.from_pandas(df_image_0)

Second, write the table into parquet file say file_name.parquet

# Parquet with Brotli compression
pq.write_table(table, 'file_name.parquet')

NOTE: parquet files can be further compressed while writing. Following are the popular compression formats.

Snappy ( default, requires no argument)
gzip
brotli

Parquet with Snappy compression

 pq.write_table(table, 'file_name.parquet')

Parquet with GZIP compression

pq.write_table(table, 'file_name.parquet', compression='GZIP')

Parquet with Brotli compression

pq.write_table(table, 'file_name.parquet', compression='BROTLI')

Comparative comparison achieved with different formats of parquet

Reference: https://tech.blueyonder.com/efficient-dataframe-storage-with-apache-parquet/

is this compression for only archive purposes? Can we use this compressed parquet file to build lets say a table ? — ibozkurt79, Nov 28 '22 at 18:26

Mark S · Answer 3 · 2017-02-17T20:40:18.463

There is a relatively early implementation of a package called fastparquet - it could be a good use case for what you need.

https://github.com/dask/fastparquet

conda install -c conda-forge fastparquet

or

pip install fastparquet

from fastparquet import write 
write('outfile.parq', df)

or, if you want to use some file options, like row grouping/compression:

write('outfile2.parq', df, row_group_offsets=[0, 10000, 20000], compression='GZIP', file_scheme='hive')

Lionel · Answer 4 · 2019-06-07T17:17:13.240

5

Yes, it is possible. Here is example code:

import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
table = pa.Table.from_pandas(df, preserve_index=True)
pq.write_table(table, 'output.parquet')

edited Jun 07 '19 at 17:17

answered Oct 04 '18 at 17:12

Lionel

3,188
5
27
40

score 4 · Answer 5 · answered Nov 20 '17 at 19:16

4

pyarrow has support for storing pandas dataframes:

import pyarrow

pyarrow.Table.from_pandas(dataset)

answered Nov 20 '17 at 19:16

hangc

4,730
10
33
66

Grant Shannon · Answer 6 · 2021-09-09T05:50:34.733

1

this is the approach that worked for me - similar to the above - but also chose to stipulate the compression type:

set up test dataframe

df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})

convert data frame to parquet and save to current directory

df.to_parquet('df.parquet.gzip', compression='gzip')

read the parquet file in current directory, back into a pandas data frame

pd.read_parquet('df.parquet.gzip')

output:

    col1    col2
0    1       3
1    2       4

edited Sep 09 '21 at 05:50

answered Oct 02 '18 at 13:46

Grant Shannon

4,709
1
46
36

4

Why do we need to import when we don't use anything from it? – MattSom May 17 '20 at 21:19

score 1 · Answer 7 · answered Feb 07 '22 at 14:39

Pandas directly support parquet so-

df.to_parquet('df.parquet.gzip', compression='gzip')
# this will convert the df to parquet format


df_parquet = pd.read_parquet('df.parquet.gzip')
# This will read the parquet file

df.to_csv('filename.csv')
# this will convert back the parquet to CSV

score 1 · Answer 8 · answered May 26 '22 at 07:37

Yup, quite possible to write a pandas dataframe to the binary parquet format. Some additional libraries are required like pyarrow and fastparquet.

import pyarrow 
import pandas as pd
#read parquet file into pandas dataframe
df=pd.read_parquet('file_location/file_path.parquet',engine='pyarrow')
#writing dataframe back to source file
df.to_parquet('file_location/file_path.parquet', engine='pyarrow')

Python: save pandas data frame to parquet file

8 Answers8

Yes pandas supports saving the dataframe in parquet format.

Simple method to write pandas dataframe to parquet.

NOTE: parquet files can be further compressed while writing. Following are the popular compression formats.

Comparative comparison achieved with different formats of parquet

Linked