A Python interface to the Parquet file format.
Questions tagged [fastparquet]
141 questions
75
votes
5 answers
A comparison between fastparquet and pyarrow?
After some searching I failed to find a thorough comparison of fastparquet and pyarrow.
I found this blog post (a basic comparison of speeds).
and a github discussion that claims that files created with fastparquet do not support AWS-athena (btw…

moshevi
- 4,999
- 5
- 33
- 50
63
votes
5 answers
How to read partitioned parquet files from S3 using pyarrow in python
I looking for ways to read data from multiple partitioned directories from s3 using…

stormfield
- 1,696
- 1
- 14
- 26
28
votes
3 answers
Decompression 'SNAPPY' not available with fastparquet
I am trying to use fastparquet to open a file, but I get the error:
RuntimeError: Decompression 'SNAPPY' not available. Options: ['GZIP', 'UNCOMPRESSED']
I have the following installed and have rebooted my interpreter:
python …

B. Sharp
- 281
- 1
- 3
- 6
22
votes
4 answers
pyarrow.lib.ArrowInvalid: ('Could not convert X with type Y: did not recognize Python value type when inferring an Arrow data type')
Using pyarrow to convert a pandas.DataFrame containing Player objects to a pyarrow.Table with the following code
import pandas as pd
import pyarrow as pa
class Player:
def __init__(self, name, age, gender):
self.name = name
…

Nyxynyx
- 61,411
- 155
- 482
- 830
20
votes
1 answer
Does any Python library support writing arrays of structs to Parquet files?
I want to write data where some columns are arrays of strings or arrays of structs (typically key-value pairs) into a Parquet file for use in AWS Athena.
After finding two Python libraries (Arrow and fastparquet) supporting writing to Parquet files…

moonhouse
- 600
- 3
- 20
17
votes
1 answer
Is saving a HUGE dask dataframe into parquet possible?
I have a dataframe made up of 100,000+ rows and each row has 100,000 columns, totally to 10,000,000,000 float values.
I've managed to read them in previously in a csv (tab-separated) file and I successfully read them to a 50 cores Xeon machine with…

alvas
- 115,346
- 109
- 446
- 738
12
votes
6 answers
Pandas dataframe type datetime64[ns] is not working in Hive/Athena
I am working on a python application which just converts csv file to hive/athena compatible parquet format and I am using fastparquet and pandas libraries to perform this. There are timestamp values in csv file like 2018-12-21 23:45:00 which needs…

prasannads
- 609
- 2
- 14
- 28
9
votes
3 answers
How to open huge parquet file using Pandas without enough RAM
I am trying to read a decently large Parquet file (~2 GB with about ~30 million rows) into my Jupyter Notebook (in Python 3) using the Pandas read_parquet function. I have also installed the pyarrow and fastparquet libraries which the read_parquet…

qxzsilver
- 522
- 1
- 6
- 21
9
votes
2 answers
Unable to read a parquet file
I am breaking my head over this right now. I am new to this parquet files, and I am running into a LOT of issues with it.
I am thrown an error that reads OSError: Passed non-file path: \datasets\proj\train\train.parquet each time I try to create a…

Anonymous Person
- 1,437
- 8
- 26
- 47
8
votes
1 answer
Fastparquet giving "TypeError: expected str, bytes or os.PathLike object, not _io.BytesIO" while using dataframe.to_parquet()
I'm trying to create a code for AWS Lambda to convert csv to parquet. I can do that using Pyarrow but it is too large in size(~200 MB uncompressed) due to which I can't use it in deployment package for Lambda. I'm trying to write the parquet file to…

jonsnow
- 89
- 4
8
votes
1 answer
filtering with dask read_parquet method gives unwanted results
I am trying to read parquet files using thedask read_parquet method and the filters kwarg. however it sometimes doesn't filter according to the given condition.
Example:
creating and saving data frame with a dates column
import pandas as pd
import…

moshevi
- 4,999
- 5
- 33
- 50
8
votes
2 answers
pandas to_parquet fails on large datasets
I'm trying to save a very large dataset using pandas to_parquet, and it seems to fail when exceeding a certain limit, both with 'pyarrow' and 'fastparquet'. I reproduced the errors I am getting with the following code, and would be happy to hear…

kenissur
- 171
- 1
- 2
- 7
8
votes
1 answer
error with snappy while importing fastparquet in python
I have installed installed the following modules in my EC2 server which already has python (3.6) & anaconda installed :
snappy
pyarrow
s3fs
fastparquet
except fastparquet everything else works on importing. When I try to import fastparquet it…

stormfield
- 1,696
- 1
- 14
- 26
7
votes
2 answers
Create Parquet files from stream in python in memory-efficient manner
It appears the most common way in Python to create Parquet files is to first create a Pandas dataframe and then use pyarrow to write the table to parquet. I worry that this might be overly taxing in memory usage - as it requires at least one full…

aaronsteers
- 2,277
- 2
- 21
- 38
7
votes
1 answer
Streaming parquet file python and only downsampling
I have data in parquet format which is too big to fit into memory (6 GB). I am looking for a way to read and process the file using Python 3.6. Is there a way to stream the file, down-sample, and save to a dataframe? Ultimately, I would like to have…

Sjoseph
- 853
- 2
- 14
- 23