How to read a Parquet file into Pandas DataFrame?

Question

How to read a modestly sized Parquet data-set into an in-memory Pandas DataFrame without setting up a cluster computing infrastructure such as Hadoop or Spark? This is only a moderate amount of data that I would like to read in-memory with a simple Python script on a laptop. The data does not reside on HDFS. It is either on the local file system or possibly in S3. I do not want to spin up and configure other services like Hadoop, Hive or Spark.

I thought Blaze/Odo would have made this possible: the Odo documentation mentions Parquet, but the examples seem all to be going through an external Hive runtime.

Do you happen to have the data openly available? My branch of python-parquet https://github.com/martindurant/parquet-python/tree/py3 had a pandas reader in parquet.rparquet, you could try it. There are many parquet constructs it cannot handle. — mdurant, Nov 19 '15 at 21:21
Wait for the Apache Arrow project that the Pandas author Wes Mckinney is part of. http://wesmckinney.com/blog/pandas-and-apache-arrow/ After it is done, users should be able to read in Parquet file directly from Pandas. — XValidated, Apr 09 '16 at 00:36
Since the question is closed as off-topic (but still the first result on Google) I have to answer in a comment.. You can now use pyarrow to read a parquet file and convert it to a pandas DataFrame: `import pyarrow.parquet as pq; df = pq.read_table('dataset.parq').to_pandas()` — sroecker, May 27 '17 at 11:34
Kinda annoyed that this question was closed. Spark and parquet are (still) relatively poorly documented. Am also looking for the answer to this. — user48956, Jul 06 '17 at 16:40
Have a look at https://github.com/dask/fastparquet . For an introduction see https://www.continuum.io/blog/developer-blog/introducing-fastparquet . — asmaier, Aug 22 '17 at 15:25
Both the fastparquet and pyarrow libraries make it possible to read a parquet file into a pandas dataframe: https://github.com/dask/fastparquet and https://arrow.apache.org/docs/python/parquet.html — ogrisel, Oct 11 '17 at 09:07

score 202 · Accepted Answer · edited May 06 '22 at 22:11

202

pandas 0.21 introduces new functions for Parquet:

import pandas as pd
pd.read_parquet('example_pa.parquet', engine='pyarrow')

or

import pandas as pd
pd.read_parquet('example_fp.parquet', engine='fastparquet')

The above link explains:

These engines are very similar and should read/write nearly identical parquet format files. These libraries differ by having different underlying dependencies (fastparquet by using numba, while pyarrow uses a c-library).

edited May 06 '22 at 22:11

Zags

37,389
14
105
140

answered Oct 31 '17 at 13:12

chrisaycock

36,470
14
88
125

17

For most of my data, 'fastparquet' is a bit faster. Just in case `pd.read_parquet()` returns a problem with Snappy Error, run `conda install python-snappy` to install snappy. – Chau Pham Oct 17 '18 at 04:27
1

I found pyarrow to be too difficult to install (both on my local windows machine and on a cloud linux machine). Even after the python-snappy fix, there were additional issues with the compiler as well as the error module 'pyarrow' has no attribute 'compat'. fastparquet had no issues at all. – Seb Feb 21 '19 at 16:11
1

@Catbuilts You can use gzip if you don't have snappy. – Khan Jun 19 '19 at 17:00
can 'fastparquet' read ',snappy.parquet' file? – wawawa Dec 02 '20 at 11:31
2

I had the opposite experience vs. @Seb. fastparquet had a bunch of issues, pyarrow was simple pip install and off I went – Mark Z. Apr 02 '21 at 04:34

danielfrg · Answer 2 · 2017-12-14T23:16:21.523

20

Update: since the time I answered this there has been a lot of work on this look at Apache Arrow for a better read and write of parquet. Also: http://wesmckinney.com/blog/python-parquet-multithreading/

There is a python parquet reader that works relatively well: https://github.com/jcrobak/parquet-python

It will create python objects and then you will have to move them to a Pandas DataFrame so the process will be slower than pd.read_csv for example.

edited Dec 14 '17 at 23:16

answered Nov 19 '15 at 20:46

danielfrg

2,597
2
22
23

10

Actually there is pyarrow which allows both reads / writes: http://pyarrow.readthedocs.io/en/latest/parquet.html – bluszcz Jan 27 '17 at 12:54
I get a permission denied error when I try to follow your link, @bluszcz -- do you have an alternate? – snooze_bear May 16 '17 at 13:50
1

parquet-python is much slower than alternatives such as fastparquet et pyarrow: https://arrow.apache.org/docs/python/parquet.html – ogrisel Oct 11 '17 at 09:10
1

`pd.read_parquet` is now part of pandas. The other answer should be marked as valid. – ogrisel Nov 03 '17 at 07:41

score 15 · Answer 3 · answered Dec 28 '19 at 10:04

Aside from pandas, Apache pyarrow also provides way to transform parquet to dataframe

The code is simple, just type:

import pyarrow.parquet as pq

df = pq.read_table(source=your_file_path).to_pandas()

For more information, see the document from Apache pyarrow Reading and Writing Single Files

score 12 · Answer 4 · answered Aug 26 '21 at 07:27

Parquet

Step 1: Data to play with

df = pd.DataFrame({
    'student': ['personA007', 'personB', 'x', 'personD', 'personE'],
    'marks': [20,10,22,21,22],
})

Step 2: Save as Parquet

df.to_parquet('sample.parquet')

Step 3: Read from Parquet

df = pd.read_parquet('sample.parquet')

score 4 · Answer 5 · answered May 08 '21 at 08:24

When writing to parquet, consider using brotli compression. I'm getting a 70% size reduction of 8GB file parquet file by using brotli compression. Brotli makes for a smaller file and faster read/writes than gzip, snappy, pickle. Although pickle can do tuples whereas parquet does not.

df.to_parquet('df.parquet.brotli',compression='brotli')
df = pd.read_parquet('df.parquet.brotli')

score 1 · Answer 6 · answered Apr 27 '21 at 10:30

Parquet files are always large. so read it using dask.

import dask.dataframe as dd
from dask import delayed
from fastparquet import ParquetFile
import glob

files = glob.glob('data/*.parquet')

@delayed
def load_chunk(path):
    return ParquetFile(path).to_pandas()

df = dd.from_delayed([load_chunk(f) for f in files])

df.compute()

Gonçalo Peres · Answer 7 · 2022-11-08T20:51:58.383

Considering the .parquet file named data.parquet

parquet_file = '../data.parquet'

open( parquet_file, 'w+' )

Convert to Parquet

Assuming one has a dataframe parquet_df that one wants to save to the parquet file above, one can use pandas.to_parquet (this function requires either the fastparquet or pyarrow library) as follows

parquet_df.to_parquet(parquet_file)

Read from Parquet

In order to read the parquet file into a dataframe new_parquet_df, one can use pandas.read_parquet() as follows

new_parquet_df = pd.read_parquet(parquet_file)

score 0 · Answer 8 · edited Mar 05 '23 at 11:22

0

you can use python to get parquet data

1.install package pin install pandas pyarrow

2.read file

def read_parquet(file):
    result = []
    data = pd.read_parquet(file)
    for index in data.index:
        res = data.loc[index].values[0:-1]
        result.append(res)
    print(len(result))


file = "./data.parquet"
read_parquet(file)

edited Mar 05 '23 at 11:22

Roshin Raphel

2,612
4
22
40

answered Mar 01 '23 at 12:49

Wollens

81
5

that is "pip install", not "pin install". I would fix it, but small changes are not allowed, despite that fact that a one letter change means the difference between a program not running with lots of confusion, and everything working. – Roobie Nuby Jun 20 '23 at 11:07