Highest Voted 'fastparquet' Questions

75

votes

5 answers

A comparison between fastparquet and pyarrow?

After some searching I failed to find a thorough comparison of fastparquet and pyarrow. I found this blog post (a basic comparison of speeds). and a github discussion that claims that files created with fastparquet do not support AWS-athena (btw…

asked Jul 16 '18 at 12:00

moshevi

4,999
5
33
50

63

votes

5 answers

How to read partitioned parquet files from S3 using pyarrow in python

I looking for ways to read data from multiple partitioned directories from s3 using…

python parquet pyarrow fastparquet python-s3fs

asked Jul 13 '17 at 13:56

stormfield

1,696
1
14
26

28

votes

3 answers

Decompression 'SNAPPY' not available with fastparquet

I am trying to use fastparquet to open a file, but I get the error: RuntimeError: Decompression 'SNAPPY' not available. Options: ['GZIP', 'UNCOMPRESSED'] I have the following installed and have rebooted my interpreter: python …

python-3.x snappy fastparquet

asked Jun 11 '18 at 15:01

B. Sharp

281
1
3
6

22

votes

4 answers

pyarrow.lib.ArrowInvalid: ('Could not convert X with type Y: did not recognize Python value type when inferring an Arrow data type')

Using pyarrow to convert a pandas.DataFrame containing Player objects to a pyarrow.Table with the following code import pandas as pd import pyarrow as pa class Player: def __init__(self, name, age, gender): self.name = name …

python pandas parquet pyarrow fastparquet

asked Jan 07 '20 at 22:07

Nyxynyx

61,411
155
482
830

20

votes

1 answer

Does any Python library support writing arrays of structs to Parquet files?

I want to write data where some columns are arrays of strings or arrays of structs (typically key-value pairs) into a Parquet file for use in AWS Athena. After finding two Python libraries (Arrow and fastparquet) supporting writing to Parquet files…

python parquet pyarrow fastparquet

asked Jun 15 '18 at 13:21

moonhouse

600
3
20

17

votes

1 answer

Is saving a HUGE dask dataframe into parquet possible?

I have a dataframe made up of 100,000+ rows and each row has 100,000 columns, totally to 10,000,000,000 float values. I've managed to read them in previously in a csv (tab-separated) file and I successfully read them to a 50 cores Xeon machine with…

python dataframe parquet dask fastparquet

asked May 26 '17 at 06:00

alvas

115,346
109
446
738

12

votes

6 answers

Pandas dataframe type datetime64[ns] is not working in Hive/Athena

I am working on a python application which just converts csv file to hive/athena compatible parquet format and I am using fastparquet and pandas libraries to perform this. There are timestamp values in csv file like 2018-12-21 23:45:00 which needs…

python pandas hive amazon-athena fastparquet

asked Dec 25 '18 at 06:06

prasannads

609
2
14
28

9

votes

3 answers

How to open huge parquet file using Pandas without enough RAM

I am trying to read a decently large Parquet file (~2 GB with about ~30 million rows) into my Jupyter Notebook (in Python 3) using the Pandas read_parquet function. I have also installed the pyarrow and fastparquet libraries which the read_parquet…

python pandas parquet pyarrow fastparquet

asked Feb 11 '20 at 03:59

qxzsilver

522
1
6
21

9

votes

2 answers

Unable to read a parquet file

I am breaking my head over this right now. I am new to this parquet files, and I am running into a LOT of issues with it. I am thrown an error that reads OSError: Passed non-file path: \datasets\proj\train\train.parquet each time I try to create a…

python pandas parquet pyarrow fastparquet

asked Mar 13 '19 at 16:58

Anonymous Person

1,437
8
26
47

8

votes

1 answer

Fastparquet giving "TypeError: expected str, bytes or os.PathLike object, not _io.BytesIO" while using dataframe.to_parquet()

I'm trying to create a code for AWS Lambda to convert csv to parquet. I can do that using Pyarrow but it is too large in size(~200 MB uncompressed) due to which I can't use it in deployment package for Lambda. I'm trying to write the parquet file to…

python aws-lambda fastparquet

asked Apr 22 '20 at 12:29

jonsnow

89
4

8

votes

1 answer

filtering with dask read_parquet method gives unwanted results

I am trying to read parquet files using thedask read_parquet method and the filters kwarg. however it sometimes doesn't filter according to the given condition. Example: creating and saving data frame with a dates column import pandas as pd import…

python dataframe filtering dask fastparquet

asked Jul 09 '18 at 11:18

moshevi

4,999
5
33
50

8

votes

2 answers

pandas to_parquet fails on large datasets

I'm trying to save a very large dataset using pandas to_parquet, and it seems to fail when exceeding a certain limit, both with 'pyarrow' and 'fastparquet'. I reproduced the errors I am getting with the following code, and would be happy to hear…

pandas parquet pyarrow fastparquet

asked Jun 10 '18 at 09:23

kenissur

171
1
2
7

8

votes

1 answer

error with snappy while importing fastparquet in python

I have installed installed the following modules in my EC2 server which already has python (3.6) & anaconda installed : snappy pyarrow s3fs fastparquet except fastparquet everything else works on importing. When I try to import fastparquet it…

python anaconda conda snappy fastparquet

asked Jun 01 '17 at 07:31

stormfield

1,696
1
14
26

7

votes

2 answers

Create Parquet files from stream in python in memory-efficient manner

It appears the most common way in Python to create Parquet files is to first create a Pandas dataframe and then use pyarrow to write the table to parquet. I worry that this might be overly taxing in memory usage - as it requires at least one full…

python parquet pyarrow fastparquet

asked Nov 11 '20 at 17:48

aaronsteers

2,277
2
21
38

7

votes

1 answer

Streaming parquet file python and only downsampling

I have data in parquet format which is too big to fit into memory (6 GB). I am looking for a way to read and process the file using Python 3.6. Is there a way to stream the file, down-sample, and save to a dataframe? Ultimately, I would like to have…

python-3.x parquet pyarrow fastparquet

asked Jan 02 '19 at 15:28

Sjoseph

853
2
14
23

Questions tagged [fastparquet]

Resources:

A comparison between fastparquet and pyarrow?

How to read partitioned parquet files from S3 using pyarrow in python

Decompression 'SNAPPY' not available with fastparquet

pyarrow.lib.ArrowInvalid: ('Could not convert X with type Y: did not recognize Python value type when inferring an Arrow data type')

Does any Python library support writing arrays of structs to Parquet files?

Is saving a HUGE dask dataframe into parquet possible?

Pandas dataframe type datetime64[ns] is not working in Hive/Athena

How to open huge parquet file using Pandas without enough RAM

Unable to read a parquet file

Fastparquet giving "TypeError: expected str, bytes or os.PathLike object, not _io.BytesIO" while using dataframe.to_parquet()

filtering with dask read_parquet method gives unwanted results

pandas to_parquet fails on large datasets

error with snappy while importing fastparquet in python

Create Parquet files from stream in python in memory-efficient manner

Streaming parquet file python and only downsampling