75

After some searching I failed to find a thorough comparison of fastparquet and pyarrow.

I found this blog post (a basic comparison of speeds).

and a github discussion that claims that files created with fastparquet do not support AWS-athena (btw is it still the case?)

when/why would I use one over the other? what are the major advantages and disadvantages ?


my specific use case is processing data with dask writing it to s3 and then reading/analyzing it with AWS-athena.

moshevi
  • 4,999
  • 5
  • 33
  • 50
  • 1
    Could be considered an "opinion" question, but there may be technical points that can make a decent answer. – mdurant Jul 16 '18 at 15:23
  • Are you trying to build a datalake using Dask instead of AWS Glue? I'm asking cos I'm on the same boat. – rpanai Jul 17 '18 at 16:00
  • no, I am reading from a s3 parquet dataset processing it and writing it to another parquet dataset. i don't have a data variety problem (which lakes try to solve). – moshevi Jul 17 '18 at 18:08
  • 1
    Note that linked benchmark has very limited scope, it presents single datasize, and single data type. So you cannot really draw any conclusion how those tools scales, or how they handle other data types. And for python strings are especially interesting, as they are commonly a bottleneck in many processes. – jangorecki Sep 24 '18 at 04:54

5 Answers5

31

I used both fastparquet and pyarrow for converting protobuf data to parquet and to query the same in S3 using Athena. Both worked, however, in my use-case, which is a lambda function, package zip file has to be lightweight, so went ahead with fastparquet. (fastparquet library was only about 1.1mb, while pyarrow library was 176mb, and Lambda package limit is 250mb).

I used the following to store a dataframe as parquet file:

from fastparquet import write

parquet_file = path.join(filename + '.parq')
write(parquet_file, df_data)
Daenerys
  • 462
  • 4
  • 8
  • 2
    I would point out that when installing `fastparquet` i got `Downloading fastparquet-0.4.1.tar.gz (28.6 MB)` today. – moshevi Aug 25 '20 at 12:51
  • 3
    aws-data-wrangler provides pre-built layers that are optimized. They include PyArrow and are definitely the easiest way to work with Parquet in Lambda these day: https://github.com/awslabs/aws-data-wrangler – Powers Sep 11 '21 at 13:07
17

However, since the question lacks concrete criteria, and I came here for a good "default choice", I want to state that pandas default engine for DataFrame objects is pyarrow (see pandas docs).

d4tm4x
  • 458
  • 4
  • 14
8

I would point out that the author of the speed comparison is also the author of pyarrow :) I can speak about the fastparquet case.

From your point of view, the most important thing to know is compatibility. Athena is not one of the test targets for fastparquet (or pyarrow), so you should test thoroughly before making your choice. There are a number of options that you may want to envoke (docs) for datetime representation, nulls, types, that may be important to you.

Writing to s3 using dask is certainly a test case for fastparquet, and I believe pyarrow should have no problem with that either.

mdurant
  • 27,272
  • 5
  • 45
  • 74
4

I just used fastparquet for a case to get out data from Elasticsearch and to store it in S3 and query with Athena and had no issue at all.

I used the following to store a dataframe in S3 as parquet file:

import s3fs
import fastparquet as fp
import pandas as pd
import numpy as np

s3 = s3fs.S3FileSystem()
myopen = s3.open
s3bucket = 'mydata-aws-bucket/'

# random dataframe for demo
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))

parqKey = s3bucket + "datafile"  + ".parq.snappy"
fp.write(parqKey, df ,compression='SNAPPY', open_with=myopen)

My table look similar like this in Athena:

CREATE EXTERNAL TABLE IF NOT EXISTS myanalytics_parquet (
  `column1` string,
  `column2` int,
  `column3` DOUBLE,
  `column4` int,
  `column5` string
 )
STORED AS PARQUET
LOCATION 's3://mydata-aws-bucket/'
tblproperties ("parquet.compress"="SNAPPY")
3

This question may be a bit old, but I happen to be working on the same issue and I found this benchmark https://wesmckinney.com/blog/python-parquet-update/ . According to it, pyarrow is faster than fastparquet, little wonder it is the default engine used in dask.

Update:

An update to my earlier response. I have been more lucky writing with pyarrow and reading with fastparquet in google cloud storage.

  • 1
    (but, again, the author of that blog is the author of arrow) – mdurant Jul 26 '19 at 13:04
  • An update to my earlier response. I have been more lucky writing with pyarrow and reading with fastparquet in google cloud storage. – Aladejubelo Oluwashina Sep 14 '19 at 10:49
  • My usecase was to read data from hbase and copy to azure. I used pyarrow to convert pandas dataframe to parquet files. But when i read parquet files from blob using pyarrow i faced lot of schema related issues even after defining schema. Now using fastparquet for both reading and writing without any schema issues. – Neeraj Sharma Apr 08 '20 at 06:35
  • 3
    isn't this the same benchmark i've linked in the question? – moshevi Aug 03 '20 at 19:15
  • pyarrow is default in pandas, fastparquet in dask – seanv507 Jan 13 '21 at 08:40
  • Pyarrow is default for dask too : https://docs.dask.org/en/stable/generated/dask.dataframe.to_parquet.html (same for read()) – Madaray Dec 08 '22 at 13:53