Questions tagged [fastparquet]

A Python interface to the Parquet file format.

Resources:

141 questions
75
votes
5 answers

A comparison between fastparquet and pyarrow?

After some searching I failed to find a thorough comparison of fastparquet and pyarrow. I found this blog post (a basic comparison of speeds). and a github discussion that claims that files created with fastparquet do not support AWS-athena (btw…
moshevi
  • 4,999
  • 5
  • 33
  • 50
63
votes
5 answers

How to read partitioned parquet files from S3 using pyarrow in python

I looking for ways to read data from multiple partitioned directories from s3 using…
stormfield
  • 1,696
  • 1
  • 14
  • 26
28
votes
3 answers

Decompression 'SNAPPY' not available with fastparquet

I am trying to use fastparquet to open a file, but I get the error: RuntimeError: Decompression 'SNAPPY' not available. Options: ['GZIP', 'UNCOMPRESSED'] I have the following installed and have rebooted my interpreter: python …
B. Sharp
  • 281
  • 1
  • 3
  • 6
22
votes
4 answers

pyarrow.lib.ArrowInvalid: ('Could not convert X with type Y: did not recognize Python value type when inferring an Arrow data type')

Using pyarrow to convert a pandas.DataFrame containing Player objects to a pyarrow.Table with the following code import pandas as pd import pyarrow as pa class Player: def __init__(self, name, age, gender): self.name = name …
Nyxynyx
  • 61,411
  • 155
  • 482
  • 830
20
votes
1 answer

Does any Python library support writing arrays of structs to Parquet files?

I want to write data where some columns are arrays of strings or arrays of structs (typically key-value pairs) into a Parquet file for use in AWS Athena. After finding two Python libraries (Arrow and fastparquet) supporting writing to Parquet files…
moonhouse
  • 600
  • 3
  • 20
17
votes
1 answer

Is saving a HUGE dask dataframe into parquet possible?

I have a dataframe made up of 100,000+ rows and each row has 100,000 columns, totally to 10,000,000,000 float values. I've managed to read them in previously in a csv (tab-separated) file and I successfully read them to a 50 cores Xeon machine with…
alvas
  • 115,346
  • 109
  • 446
  • 738
12
votes
6 answers

Pandas dataframe type datetime64[ns] is not working in Hive/Athena

I am working on a python application which just converts csv file to hive/athena compatible parquet format and I am using fastparquet and pandas libraries to perform this. There are timestamp values in csv file like 2018-12-21 23:45:00 which needs…
prasannads
  • 609
  • 2
  • 14
  • 28
9
votes
3 answers

How to open huge parquet file using Pandas without enough RAM

I am trying to read a decently large Parquet file (~2 GB with about ~30 million rows) into my Jupyter Notebook (in Python 3) using the Pandas read_parquet function. I have also installed the pyarrow and fastparquet libraries which the read_parquet…
qxzsilver
  • 522
  • 1
  • 6
  • 21
9
votes
2 answers

Unable to read a parquet file

I am breaking my head over this right now. I am new to this parquet files, and I am running into a LOT of issues with it. I am thrown an error that reads OSError: Passed non-file path: \datasets\proj\train\train.parquet each time I try to create a…
Anonymous Person
  • 1,437
  • 8
  • 26
  • 47
8
votes
1 answer

Fastparquet giving "TypeError: expected str, bytes or os.PathLike object, not _io.BytesIO" while using dataframe.to_parquet()

I'm trying to create a code for AWS Lambda to convert csv to parquet. I can do that using Pyarrow but it is too large in size(~200 MB uncompressed) due to which I can't use it in deployment package for Lambda. I'm trying to write the parquet file to…
jonsnow
  • 89
  • 4
8
votes
1 answer

filtering with dask read_parquet method gives unwanted results

I am trying to read parquet files using thedask read_parquet method and the filters kwarg. however it sometimes doesn't filter according to the given condition. Example: creating and saving data frame with a dates column import pandas as pd import…
moshevi
  • 4,999
  • 5
  • 33
  • 50
8
votes
2 answers

pandas to_parquet fails on large datasets

I'm trying to save a very large dataset using pandas to_parquet, and it seems to fail when exceeding a certain limit, both with 'pyarrow' and 'fastparquet'. I reproduced the errors I am getting with the following code, and would be happy to hear…
kenissur
  • 171
  • 1
  • 2
  • 7
8
votes
1 answer

error with snappy while importing fastparquet in python

I have installed installed the following modules in my EC2 server which already has python (3.6) & anaconda installed : snappy pyarrow s3fs fastparquet except fastparquet everything else works on importing. When I try to import fastparquet it…
stormfield
  • 1,696
  • 1
  • 14
  • 26
7
votes
2 answers

Create Parquet files from stream in python in memory-efficient manner

It appears the most common way in Python to create Parquet files is to first create a Pandas dataframe and then use pyarrow to write the table to parquet. I worry that this might be overly taxing in memory usage - as it requires at least one full…
aaronsteers
  • 2,277
  • 2
  • 21
  • 38
7
votes
1 answer

Streaming parquet file python and only downsampling

I have data in parquet format which is too big to fit into memory (6 GB). I am looking for a way to read and process the file using Python 3.6. Is there a way to stream the file, down-sample, and save to a dataframe? Ultimately, I would like to have…
Sjoseph
  • 853
  • 2
  • 14
  • 23
1
2 3
9 10