Questions tagged [pyarrow]

pyarrow is a Python interface for Apache Arrow

About:

pyarrow provides the Python API of Apache Arrow.

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.

Resources:

1078 questions
196
votes
2 answers

What are the differences between feather and parquet?

Both are columnar (disk-)storage formats for use in data analysis systems. Both are integrated within Apache Arrow (pyarrow package for python) and are designed to correspond with Arrow as a columnar in-memory analytics layer. How do both formats…
Darkonaut
  • 20,186
  • 7
  • 54
  • 65
76
votes
9 answers

How to read a list of parquet files from S3 as a pandas dataframe using pyarrow?

I have a hacky way of achieving this using boto3 (1.4.4), pyarrow (0.4.1) and pandas (0.20.3). First, I can read a single parquet file locally like this: import pyarrow.parquet as pq path =…
Diego Mora Cespedes
  • 3,605
  • 5
  • 26
  • 33
75
votes
5 answers

A comparison between fastparquet and pyarrow?

After some searching I failed to find a thorough comparison of fastparquet and pyarrow. I found this blog post (a basic comparison of speeds). and a github discussion that claims that files created with fastparquet do not support AWS-athena (btw…
moshevi
  • 4,999
  • 5
  • 33
  • 50
63
votes
5 answers

How to read partitioned parquet files from S3 using pyarrow in python

I looking for ways to read data from multiple partitioned directories from s3 using…
stormfield
  • 1,696
  • 1
  • 14
  • 26
52
votes
5 answers

Using pyarrow how do you append to parquet file?

How do you append/update to a parquet file with pyarrow? import pandas as pd import pyarrow as pa import pyarrow.parquet as pq table2 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]}) table3…
Merlin
  • 24,552
  • 41
  • 131
  • 206
28
votes
5 answers

How to set/get Pandas dataframes into Redis using pyarrow

Using dd = {'ID': ['H576','H577','H578','H600', 'H700'], 'CD': ['AAAAAAA', 'BBBBB', 'CCCCCC','DDDDDD', 'EEEEEEE']} df = pd.DataFrame(dd) Pre Pandas 0.25, this below worked. set: redisConn.set("key", df.to_msgpack(compress='zlib')) get: …
Merlin
  • 24,552
  • 41
  • 131
  • 206
27
votes
4 answers

Using predicates to filter rows from pyarrow.parquet.ParquetDataset

I have a parquet dataset stored on s3, and I would like to query specific rows from the dataset. I was able to do that using petastorm but now I want to do that using only pyarrow. Here's my attempt: import pyarrow.parquet as pq import s3fs fs =…
kluu
  • 2,848
  • 3
  • 15
  • 35
26
votes
5 answers

Python pip install pyarrow error, unable to execute 'cmake'

I'm trying to install the pyarrow on a master instance of my EMR cluster, however I'm always receiving this error. [hadoop@ip-XXX-XXX-XXX-XXX ~]$ sudo /usr/bin/pip-3.4 install pyarrow Collecting pyarrow Downloading…
Yiming Wu
  • 611
  • 1
  • 5
  • 11
22
votes
4 answers

pyarrow.lib.ArrowInvalid: ('Could not convert X with type Y: did not recognize Python value type when inferring an Arrow data type')

Using pyarrow to convert a pandas.DataFrame containing Player objects to a pyarrow.Table with the following code import pandas as pd import pyarrow as pa class Player: def __init__(self, name, age, gender): self.name = name …
Nyxynyx
  • 61,411
  • 155
  • 482
  • 830
22
votes
2 answers

how to enable Apache Arrow in Pyspark

I am trying to enable Apache Arrow for conversion to Pandas. I am using: pyspark 2.4.4 pyarrow 0.15.0 pandas 0.25.1 numpy 1.17.2 This is the example code spark.conf.set("spark.sql.execution.arrow.enabled", "true") x = pd.Series([1, 2, 3]) df =…
R. Lamari
  • 331
  • 1
  • 2
  • 3
22
votes
2 answers

How to write Parquet metadata with pyarrow?

I use pyarrow to create and analyse Parquet tables with biological information and I need to store some metadata, e.g. which sample the data comes from, how it was obtained and processed. Parquet seems to support file-wide metadata, but I cannot…
golobor
  • 1,208
  • 11
  • 10
21
votes
1 answer

Feather format for long term storage since the release of apache arrow 1.0.1

As I'm given to understand due to the search of issues in the Feather Github, as well as questions in stackoverflow such as What are the differences between feather and parquet?, the Feather format was not recommended as long term storage due to…
Serelia
  • 213
  • 2
  • 6
20
votes
1 answer

Does any Python library support writing arrays of structs to Parquet files?

I want to write data where some columns are arrays of strings or arrays of structs (typically key-value pairs) into a Parquet file for use in AWS Athena. After finding two Python libraries (Arrow and fastparquet) supporting writing to Parquet files…
moonhouse
  • 600
  • 3
  • 20
19
votes
6 answers

"pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data" when sending data to BigQuery without schema

I'm working on a script where I'm sending a dataframe to BigQuery: load_job = bq_client.load_table_from_dataframe( df, '.'.join([PROJECT, DATASET, PROGRAMS_TABLE]) ) # Wait for the load job to complete return load_job.result() This is working…
Simon Breton
  • 2,638
  • 7
  • 50
  • 105
19
votes
3 answers

Overwrite parquet file with pyarrow in S3

I'm trying to overwrite my parquet files with pyarrow that are in S3. I've seen the documentacion and I haven't found anything. Here is my code: from s3fs.core import S3FileSystem import pyarrow as pa import pyarrow.parquet as pq s3 =…
Mateo Rod
  • 544
  • 2
  • 6
  • 14
1
2 3
71 72