164

How to read a modestly sized Parquet data-set into an in-memory Pandas DataFrame without setting up a cluster computing infrastructure such as Hadoop or Spark? This is only a moderate amount of data that I would like to read in-memory with a simple Python script on a laptop. The data does not reside on HDFS. It is either on the local file system or possibly in S3. I do not want to spin up and configure other services like Hadoop, Hive or Spark.

I thought Blaze/Odo would have made this possible: the Odo documentation mentions Parquet, but the examples seem all to be going through an external Hive runtime.

Gonçalo Peres
  • 11,752
  • 3
  • 54
  • 83
Daniel Mahler
  • 7,653
  • 5
  • 51
  • 90
  • 3
    Do you happen to have the data openly available? My branch of python-parquet https://github.com/martindurant/parquet-python/tree/py3 had a pandas reader in parquet.rparquet, you could try it. There are many parquet constructs it cannot handle. – mdurant Nov 19 '15 at 21:21
  • 4
    Wait for the Apache Arrow project that the Pandas author Wes Mckinney is part of. http://wesmckinney.com/blog/pandas-and-apache-arrow/ After it is done, users should be able to read in Parquet file directly from Pandas. – XValidated Apr 09 '16 at 00:36
  • 4
    Since the question is closed as off-topic (but still the first result on Google) I have to answer in a comment.. You can now use pyarrow to read a parquet file and convert it to a pandas DataFrame: `import pyarrow.parquet as pq; df = pq.read_table('dataset.parq').to_pandas()` – sroecker May 27 '17 at 11:34
  • 4
    Kinda annoyed that this question was closed. Spark and parquet are (still) relatively poorly documented. Am also looking for the answer to this. – user48956 Jul 06 '17 at 16:40
  • 1
    Have a look at https://github.com/dask/fastparquet . For an introduction see https://www.continuum.io/blog/developer-blog/introducing-fastparquet . – asmaier Aug 22 '17 at 15:25
  • 2
    Both the fastparquet and pyarrow libraries make it possible to read a parquet file into a pandas dataframe: https://github.com/dask/fastparquet and https://arrow.apache.org/docs/python/parquet.html – ogrisel Oct 11 '17 at 09:07
  • 1
  • 1
    @DanielMahler consider updating the accepted answer – MichaelChirico Dec 01 '17 at 11:04

8 Answers8

202

pandas 0.21 introduces new functions for Parquet:

import pandas as pd
pd.read_parquet('example_pa.parquet', engine='pyarrow')

or

import pandas as pd
pd.read_parquet('example_fp.parquet', engine='fastparquet')

The above link explains:

These engines are very similar and should read/write nearly identical parquet format files. These libraries differ by having different underlying dependencies (fastparquet by using numba, while pyarrow uses a c-library).

Zags
  • 37,389
  • 14
  • 105
  • 140
chrisaycock
  • 36,470
  • 14
  • 88
  • 125
  • 17
    For most of my data, 'fastparquet' is a bit faster. Just in case `pd.read_parquet()` returns a problem with Snappy Error, run `conda install python-snappy` to install snappy. – Chau Pham Oct 17 '18 at 04:27
  • 1
    I found pyarrow to be too difficult to install (both on my local windows machine and on a cloud linux machine). Even after the python-snappy fix, there were additional issues with the compiler as well as the error module 'pyarrow' has no attribute 'compat'. fastparquet had no issues at all. – Seb Feb 21 '19 at 16:11
  • 1
    @Catbuilts You can use gzip if you don't have snappy. – Khan Jun 19 '19 at 17:00
  • can 'fastparquet' read ',snappy.parquet' file? – wawawa Dec 02 '20 at 11:31
  • 2
    I had the opposite experience vs. @Seb. fastparquet had a bunch of issues, pyarrow was simple pip install and off I went – Mark Z. Apr 02 '21 at 04:34
20

Update: since the time I answered this there has been a lot of work on this look at Apache Arrow for a better read and write of parquet. Also: http://wesmckinney.com/blog/python-parquet-multithreading/

There is a python parquet reader that works relatively well: https://github.com/jcrobak/parquet-python

It will create python objects and then you will have to move them to a Pandas DataFrame so the process will be slower than pd.read_csv for example.

danielfrg
  • 2,597
  • 2
  • 22
  • 23
  • 10
    Actually there is pyarrow which allows both reads / writes: http://pyarrow.readthedocs.io/en/latest/parquet.html – bluszcz Jan 27 '17 at 12:54
  • I get a permission denied error when I try to follow your link, @bluszcz -- do you have an alternate? – snooze_bear May 16 '17 at 13:50
  • 1
    parquet-python is much slower than alternatives such as fastparquet et pyarrow: https://arrow.apache.org/docs/python/parquet.html – ogrisel Oct 11 '17 at 09:10
  • 1
    `pd.read_parquet` is now part of pandas. The other answer should be marked as valid. – ogrisel Nov 03 '17 at 07:41
15

Aside from pandas, Apache pyarrow also provides way to transform parquet to dataframe

The code is simple, just type:

import pyarrow.parquet as pq

df = pq.read_table(source=your_file_path).to_pandas()

For more information, see the document from Apache pyarrow Reading and Writing Single Files

WY Hsu
  • 1,837
  • 2
  • 22
  • 33
12

Parquet

Step 1: Data to play with

df = pd.DataFrame({
    'student': ['personA007', 'personB', 'x', 'personD', 'personE'],
    'marks': [20,10,22,21,22],
})

Step 2: Save as Parquet

df.to_parquet('sample.parquet')

Step 3: Read from Parquet

df = pd.read_parquet('sample.parquet')
Harish Masand
  • 319
  • 2
  • 3
4

When writing to parquet, consider using brotli compression. I'm getting a 70% size reduction of 8GB file parquet file by using brotli compression. Brotli makes for a smaller file and faster read/writes than gzip, snappy, pickle. Although pickle can do tuples whereas parquet does not.

df.to_parquet('df.parquet.brotli',compression='brotli')
df = pd.read_parquet('df.parquet.brotli')
BSalita
  • 8,420
  • 10
  • 51
  • 68
1

Parquet files are always large. so read it using dask.

import dask.dataframe as dd
from dask import delayed
from fastparquet import ParquetFile
import glob

files = glob.glob('data/*.parquet')

@delayed
def load_chunk(path):
    return ParquetFile(path).to_pandas()

df = dd.from_delayed([load_chunk(f) for f in files])

df.compute()
RaaHul Dutta
  • 105
  • 1
  • 4
1

Considering the .parquet file named data.parquet

parquet_file = '../data.parquet'

open( parquet_file, 'w+' )

Convert to Parquet

Assuming one has a dataframe parquet_df that one wants to save to the parquet file above, one can use pandas.to_parquet (this function requires either the fastparquet or pyarrow library) as follows

parquet_df.to_parquet(parquet_file)

Read from Parquet

In order to read the parquet file into a dataframe new_parquet_df, one can use pandas.read_parquet() as follows

new_parquet_df = pd.read_parquet(parquet_file)
Gonçalo Peres
  • 11,752
  • 3
  • 54
  • 83
0

you can use python to get parquet data

1.install package pin install pandas pyarrow

2.read file

def read_parquet(file):
    result = []
    data = pd.read_parquet(file)
    for index in data.index:
        res = data.loc[index].values[0:-1]
        result.append(res)
    print(len(result))


file = "./data.parquet"
read_parquet(file)
Roshin Raphel
  • 2,612
  • 4
  • 22
  • 40
Wollens
  • 81
  • 5
  • that is "pip install", not "pin install". I would fix it, but small changes are not allowed, despite that fact that a one letter change means the difference between a program not running with lots of confusion, and everything working. – Roobie Nuby Jun 20 '23 at 11:07