Questions tagged [apache-arrow]

Apache Arrow™ enables execution engines to take advantage of the latest SIM D (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing.

Arrow memory format supports zero-copy reads for lightning-fast data access without serialization overhead.
Columnar layout of data also allows for a better use of CPU caches by placing all data relevant to a column operation in as compact of a format as possible.
Arrow acts as a new high-performance interface between various systems. It is also focused on supporting a wide variety of industry-standard programming languages. Java, C, C++, Python are underway and more languages are expected soon.

For installation details see this

595 questions

146

votes

1 answer

Difference between Apache parquet and arrow

I'm looking into a way to speed up my memory intensive frontend vis app. I saw some people recommend Apache Arrow, while I'm looking into it, I'm confused about the difference between Parquet and Arrow. They are both columnized data structure.…

parquet apache-arrow

asked Jun 06 '19 at 07:25

Audrey

1,728
2
11
10

votes

2 answers

Apache Arrow Java API Documentation

I am looking for useful documentations or examples for the Apache Arrow API. Can anyone point to some useful resources? I was only able to find some blogs and JAVA documentation (which doesn't say much). From what I read, it is a standard in-memory…

java apache-arrow

asked Jun 21 '17 at 11:27

Rijo Joseph

1,375
3
17
33

votes

3 answers

SQL on top of apache arrow in-browser?

I have data that is stored on a client's browser in-memory. For example, let's say the dataset is as follows: "name" (string), "age" (int32), "isAdult" (bool) "Tom" , 29 1 "Tom" , 14 …

javascript webassembly apache-arrow dremio

asked Jun 15 '19 at 00:36

David542

104,438
178
489
842

votes

5 answers

Python error using pyarrow - ArrowNotImplementedError: Support for codec 'snappy' not built

Using Python, Parquet, and Spark and running into ArrowNotImplementedError: Support for codec 'snappy' not built after upgrading to pyarrow=3.0.0. My previous version without this error was pyarrow=0.17. The error does not appear in pyarrow=1.0.1…

parquet pyarrow apache-arrow

asked Feb 02 '21 at 21:19

Russell Burdt

2,391
2
19
30

votes

6 answers

Read partitioned parquet directory (all files) in one R dataframe with apache arrow

How do I read a partitioned parquet file into R with arrow (without any spark) The situation created parquet files with a Spark pipe and save on S3 read with RStudio/RShiny with one column as index to do further analysis The parquet file…

r parquet apache-arrow

asked Oct 17 '19 at 20:02

Alex Ortner

1,097
8
24

votes

4 answers

How to read feather/arrow file natively?

I have feather format file sales.feather that I am using for exchanging data between python and R. In R I use the following command: df = arrow::read_feather("sales.feather", as_data_frame=TRUE) In python I used that: df =…

apache-spark pyspark pyarrow apache-arrow feather

asked Dec 01 '18 at 09:49

jangorecki

16,384
4
79
160

votes

4 answers

How to save a huge pandas dataframe to hdfs?

Im working with pandas and with spark dataframes. The dataframes are always very big (> 20 GB) and the standard spark functions are not sufficient for those sizes. Currently im converting my pandas dataframe to a spark dataframe like this: dataframe…

python pandas apache-spark pyarrow apache-arrow

asked Nov 20 '17 at 13:19

Mulgard

9,877
34
129
232

votes

1 answer

Arrow IPC vs Feather

What is the difference between Arrow IPC and Feather? The official Arrow documentation says: Version 2 (V2), the default version, which is exactly represented as the Arrow IPC file format on disk. V2 files support storing all Arrow data types as…

pandas apache-arrow feather vaex

asked Jun 09 '21 at 19:31

tsorn

3,365
1
29
48

votes

2 answers

Fastest way to construct pyarrow table row by row

I have a large dictionary that I want to iterate through to build a pyarrow table. The values of the dictionary are tuples of varying types and need to be unpacked and stored in separate columns in the final pyarrow table. I do know the schema ahead…

python pyarrow apache-arrow

asked Sep 14 '19 at 20:37

Josh W.

1,123
1
10
17

votes

2 answers

Reading specific partitions from a partitioned parquet dataset with pyarrow

I have a somewhat large (~20 GB) partitioned dataset in parquet format. I would like to read specific partitions from the dataset using pyarrow. I thought I could accomplish this with pyarrow.parquet.ParquetDataset, but that doesn't seem to be the…

python parquet pyarrow apache-arrow

asked Dec 28 '17 at 05:29

suvayu

4,271
2
29
35

votes

1 answer

Unable to load libhdfs when using pyarrow

I'm trying to connect to HDFS through Pyarrow, but it does not work because libhdfs library cannot be loaded. libhdfs.so is in $HADOOP_HOME/lib/native as well as in $ARROW_LIBHDFS_DIR. print(os.environ['ARROW_LIBHDFS_DIR']) fs =…

python hadoop hdfs pyarrow apache-arrow

asked Oct 31 '18 at 16:11

Pablo Velasquez

votes

4 answers

Spark dataframe to arrow

I have been using Apache Arrow with Spark for a while in Python and have been easily able to convert between dataframes and Arrow objects by using Pandas as an intermediary. Recently, however, I’ve moved from Python to Scala for interacting with…

scala apache-spark dataframe apache-arrow

asked Jul 27 '17 at 17:04

supert165

votes

1 answer

Read Parquet Files using Apache Arrow

I have some Parquet files that I've written in Python using PyArrow (Apache Arrow): pyarrow.parquet.write_table(table, "example.parquet") Now I want to read these files (and preferably get an Arrow Table) using a Java program. In Python, I can…

java python eclipse parquet apache-arrow

asked May 27 '20 at 15:42

G.M

votes

2 answers

R arrow: Error: Support for codec 'snappy' not built

I have been using the latest R arrow package (arrow_2.0.0.20201106) that supports reading and writing from AWS S3 directly (which is awesome). I don't seem to have issues when I write and read my own file (see below): write_parquet(iris,…

r snappy apache-arrow

asked Nov 20 '20 at 22:02

Mike.Gahan

4,565
23
39

votes

2 answers

Are data tables with more than 2^31 rows supported in R with the data table package yet?

I am trying to do a cross join (from the original question here), and I have 500GB of ram. The problem is that the final data.table has more than 2^31 rows, so I get this error: Error in vecseq(f__, len__, if (allow.cartesian || notjoin ||…

r merge data.table cross-join apache-arrow

asked Apr 07 '20 at 23:59

wolfsatthedoor

7,163
18
46
90

2 3

…

39 40 Next