Questions tagged [pyarrow]

pyarrow is a Python interface for Apache Arrow

About:

pyarrow provides the Python API of Apache Arrow.

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.

Resources:

1078 questions

196

votes

2 answers

What are the differences between feather and parquet?

Both are columnar (disk-)storage formats for use in data analysis systems. Both are integrated within Apache Arrow (pyarrow package for python) and are designed to correspond with Arrow as a columnar in-memory analytics layer. How do both formats…

asked Jan 03 '18 at 18:48

Darkonaut

20,186
7
54
65

votes

9 answers

How to read a list of parquet files from S3 as a pandas dataframe using pyarrow?

I have a hacky way of achieving this using boto3 (1.4.4), pyarrow (0.4.1) and pandas (0.20.3). First, I can read a single parquet file locally like this: import pyarrow.parquet as pq path =…

python pandas dataframe boto3 pyarrow

asked Jul 11 '17 at 20:01

Diego Mora Cespedes

3,605
5
26
33

votes

5 answers

A comparison between fastparquet and pyarrow?

After some searching I failed to find a thorough comparison of fastparquet and pyarrow. I found this blog post (a basic comparison of speeds). and a github discussion that claims that files created with fastparquet do not support AWS-athena (btw…

python parquet dask pyarrow fastparquet

asked Jul 16 '18 at 12:00

moshevi

4,999
5
33
50

votes

5 answers

How to read partitioned parquet files from S3 using pyarrow in python

I looking for ways to read data from multiple partitioned directories from s3 using…

python parquet pyarrow fastparquet python-s3fs

asked Jul 13 '17 at 13:56

stormfield

1,696
1
14
26

votes

5 answers

Using pyarrow how do you append to parquet file?

How do you append/update to a parquet file with pyarrow? import pandas as pd import pyarrow as pa import pyarrow.parquet as pq table2 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]}) table3…

python pandas parquet pyarrow

asked Nov 04 '17 at 17:59

Merlin

24,552
41
131
206

votes

5 answers

How to set/get Pandas dataframes into Redis using pyarrow

Using dd = {'ID': ['H576','H577','H578','H600', 'H700'], 'CD': ['AAAAAAA', 'BBBBB', 'CCCCCC','DDDDDD', 'EEEEEEE']} df = pd.DataFrame(dd) Pre Pandas 0.25, this below worked. set: redisConn.set("key", df.to_msgpack(compress='zlib')) get: …

python pandas redis pyarrow py-redis

asked Sep 16 '19 at 02:54

Merlin

24,552
41
131
206

votes

4 answers

Using predicates to filter rows from pyarrow.parquet.ParquetDataset

I have a parquet dataset stored on s3, and I would like to query specific rows from the dataset. I was able to do that using petastorm but now I want to do that using only pyarrow. Here's my attempt: import pyarrow.parquet as pq import s3fs fs =…

python pandas amazon-s3 parquet pyarrow

asked Jun 10 '19 at 08:33

kluu

2,848
3
15
35

votes

5 answers

Python pip install pyarrow error, unable to execute 'cmake'

I'm trying to install the pyarrow on a master instance of my EMR cluster, however I'm always receiving this error. [hadoop@ip-XXX-XXX-XXX-XXX ~]$ sudo /usr/bin/pip-3.4 install pyarrow Collecting pyarrow Downloading…

python-3.x cmake pip amazon-emr pyarrow

asked Sep 05 '18 at 09:12

Yiming Wu

votes

4 answers

pyarrow.lib.ArrowInvalid: ('Could not convert X with type Y: did not recognize Python value type when inferring an Arrow data type')

Using pyarrow to convert a pandas.DataFrame containing Player objects to a pyarrow.Table with the following code import pandas as pd import pyarrow as pa class Player: def __init__(self, name, age, gender): self.name = name …

python pandas parquet pyarrow fastparquet

asked Jan 07 '20 at 22:07

Nyxynyx

61,411
155
482
830

votes

2 answers

how to enable Apache Arrow in Pyspark

I am trying to enable Apache Arrow for conversion to Pandas. I am using: pyspark 2.4.4 pyarrow 0.15.0 pandas 0.25.1 numpy 1.17.2 This is the example code spark.conf.set("spark.sql.execution.arrow.enabled", "true") x = pd.Series([1, 2, 3]) df =…

pandas pyspark pyarrow

asked Oct 07 '19 at 11:58

R. Lamari

votes

2 answers

How to write Parquet metadata with pyarrow?

I use pyarrow to create and analyse Parquet tables with biological information and I need to store some metadata, e.g. which sample the data comes from, how it was obtained and processed. Parquet seems to support file-wide metadata, but I cannot…

python parquet pyarrow

asked Aug 31 '18 at 21:15

golobor

1,208
11
10

votes

1 answer

Feather format for long term storage since the release of apache arrow 1.0.1

As I'm given to understand due to the search of issues in the Feather Github, as well as questions in stackoverflow such as What are the differences between feather and parquet?, the Feather format was not recommended as long term storage due to…

python pandas dataframe pyarrow feather

asked Sep 27 '20 at 14:42

Serelia

votes

1 answer

Does any Python library support writing arrays of structs to Parquet files?

I want to write data where some columns are arrays of strings or arrays of structs (typically key-value pairs) into a Parquet file for use in AWS Athena. After finding two Python libraries (Arrow and fastparquet) supporting writing to Parquet files…

python parquet pyarrow fastparquet

asked Jun 15 '18 at 13:21

moonhouse

votes

6 answers

"pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data" when sending data to BigQuery without schema

I'm working on a script where I'm sending a dataframe to BigQuery: load_job = bq_client.load_table_from_dataframe( df, '.'.join([PROJECT, DATASET, PROGRAMS_TABLE]) ) # Wait for the load job to complete return load_job.result() This is working…

python-3.x google-bigquery google-cloud-functions pyarrow

asked Jan 10 '20 at 13:38

Simon Breton

2,638
7
50
105

votes

3 answers

Overwrite parquet file with pyarrow in S3

I'm trying to overwrite my parquet files with pyarrow that are in S3. I've seen the documentacion and I haven't found anything. Here is my code: from s3fs.core import S3FileSystem import pyarrow as pa import pyarrow.parquet as pq s3 =…

python amazon-s3 pyarrow python-s3fs

asked Aug 30 '18 at 11:22

Mateo Rod

2 3

…

71 72 Next