Questions tagged [apache-arrow]

Apache Arrow™ enables execution engines to take advantage of the latest SIM D (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing.

For installation details see this

595 questions
146
votes
1 answer

Difference between Apache parquet and arrow

I'm looking into a way to speed up my memory intensive frontend vis app. I saw some people recommend Apache Arrow, while I'm looking into it, I'm confused about the difference between Parquet and Arrow. They are both columnized data structure.…
Audrey
  • 1,728
  • 2
  • 11
  • 10
22
votes
2 answers

Apache Arrow Java API Documentation

I am looking for useful documentations or examples for the Apache Arrow API. Can anyone point to some useful resources? I was only able to find some blogs and JAVA documentation (which doesn't say much). From what I read, it is a standard in-memory…
Rijo Joseph
  • 1,375
  • 3
  • 17
  • 33
20
votes
3 answers

SQL on top of apache arrow in-browser?

I have data that is stored on a client's browser in-memory. For example, let's say the dataset is as follows: "name" (string), "age" (int32), "isAdult" (bool) "Tom" , 29 1 "Tom" , 14 …
David542
  • 104,438
  • 178
  • 489
  • 842
15
votes
5 answers

Python error using pyarrow - ArrowNotImplementedError: Support for codec 'snappy' not built

Using Python, Parquet, and Spark and running into ArrowNotImplementedError: Support for codec 'snappy' not built after upgrading to pyarrow=3.0.0. My previous version without this error was pyarrow=0.17. The error does not appear in pyarrow=1.0.1…
Russell Burdt
  • 2,391
  • 2
  • 19
  • 30
15
votes
6 answers

Read partitioned parquet directory (all files) in one R dataframe with apache arrow

How do I read a partitioned parquet file into R with arrow (without any spark) The situation created parquet files with a Spark pipe and save on S3 read with RStudio/RShiny with one column as index to do further analysis The parquet file…
Alex Ortner
  • 1,097
  • 8
  • 24
13
votes
4 answers

How to read feather/arrow file natively?

I have feather format file sales.feather that I am using for exchanging data between python and R. In R I use the following command: df = arrow::read_feather("sales.feather", as_data_frame=TRUE) In python I used that: df =…
jangorecki
  • 16,384
  • 4
  • 79
  • 160
13
votes
4 answers

How to save a huge pandas dataframe to hdfs?

Im working with pandas and with spark dataframes. The dataframes are always very big (> 20 GB) and the standard spark functions are not sufficient for those sizes. Currently im converting my pandas dataframe to a spark dataframe like this: dataframe…
Mulgard
  • 9,877
  • 34
  • 129
  • 232
12
votes
1 answer

Arrow IPC vs Feather

What is the difference between Arrow IPC and Feather? The official Arrow documentation says: Version 2 (V2), the default version, which is exactly represented as the Arrow IPC file format on disk. V2 files support storing all Arrow data types as…
tsorn
  • 3,365
  • 1
  • 29
  • 48
12
votes
2 answers

Fastest way to construct pyarrow table row by row

I have a large dictionary that I want to iterate through to build a pyarrow table. The values of the dictionary are tuples of varying types and need to be unpacked and stored in separate columns in the final pyarrow table. I do know the schema ahead…
Josh W.
  • 1,123
  • 1
  • 10
  • 17
11
votes
2 answers

Reading specific partitions from a partitioned parquet dataset with pyarrow

I have a somewhat large (~20 GB) partitioned dataset in parquet format. I would like to read specific partitions from the dataset using pyarrow. I thought I could accomplish this with pyarrow.parquet.ParquetDataset, but that doesn't seem to be the…
suvayu
  • 4,271
  • 2
  • 29
  • 35
10
votes
1 answer

Unable to load libhdfs when using pyarrow

I'm trying to connect to HDFS through Pyarrow, but it does not work because libhdfs library cannot be loaded. libhdfs.so is in $HADOOP_HOME/lib/native as well as in $ARROW_LIBHDFS_DIR. print(os.environ['ARROW_LIBHDFS_DIR']) fs =…
Pablo Velasquez
  • 111
  • 1
  • 1
  • 5
10
votes
4 answers

Spark dataframe to arrow

I have been using Apache Arrow with Spark for a while in Python and have been easily able to convert between dataframes and Arrow objects by using Pandas as an intermediary. Recently, however, I’ve moved from Python to Scala for interacting with…
supert165
  • 101
  • 1
  • 5
9
votes
1 answer

Read Parquet Files using Apache Arrow

I have some Parquet files that I've written in Python using PyArrow (Apache Arrow): pyarrow.parquet.write_table(table, "example.parquet") Now I want to read these files (and preferably get an Arrow Table) using a Java program. In Python, I can…
G.M
  • 530
  • 4
  • 20
8
votes
2 answers

R arrow: Error: Support for codec 'snappy' not built

I have been using the latest R arrow package (arrow_2.0.0.20201106) that supports reading and writing from AWS S3 directly (which is awesome). I don't seem to have issues when I write and read my own file (see below): write_parquet(iris,…
Mike.Gahan
  • 4,565
  • 23
  • 39
8
votes
2 answers

Are data tables with more than 2^31 rows supported in R with the data table package yet?

I am trying to do a cross join (from the original question here), and I have 500GB of ram. The problem is that the final data.table has more than 2^31 rows, so I get this error: Error in vecseq(f__, len__, if (allow.cartesian || notjoin ||…
wolfsatthedoor
  • 7,163
  • 18
  • 46
  • 90
1
2 3
39 40