10

I have been using Apache Arrow with Spark for a while in Python and have been easily able to convert between dataframes and Arrow objects by using Pandas as an intermediary.

Recently, however, I’ve moved from Python to Scala for interacting with Spark and using Arrow isn’t as intuitive in Scala (Java) as it is in Python. My basic need is to convert a Spark dataframe (or RDD since they’re easily convertible) to an Arrow object as quickly as possible. My initial thought was to convert to Parquet first and go from Parquet to Arrow since I remembered that pyarrow could read from Parquet. However, and please correct me if I’m wrong, after looking at the Arrow Java docs for a while I couldn’t find a Parquet to Arrow function. Does this function not exist in the Java version? Is there another way to get a Spark dataframe to an Arrow object? Perhaps converting the dataframe's columns to arrays then converting to arrow objects?

Any help would be much appreciated. Thank you

EDIT: Found the following link that converts a parquet schema to an Arrow schema. But it doesn't seem to return an Arrow object from a parquet file like I need: https://github.com/apache/parquet-mr/blob/70f28810a5547219e18ffc3465f519c454fee6e5/parquet-arrow/src/main/java/org/apache/parquet/arrow/schema/SchemaConverter.java

Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121
supert165
  • 101
  • 1
  • 5
  • 1
    Wes McKinney is one of the best people [IMHO] to answer this question. I tweeted him (https://twitter.com/gstaubli/status/895763929653157888) in the hopes of getting a response. Fingers crossed. – Garren S Aug 10 '17 at 21:56

4 Answers4

5

There is not a Parquet <-> Arrow converter available as a library in Java yet. You could have a look at the Arrow-based Parquet converter in Dremio (https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/exec/store/parquet) for inspiration. I am sure the Apache Parquet project would welcome your contribution implementing this functionality.

We have developed an Arrow reader/writer for Parquet in the C++ implementation: https://github.com/apache/parquet-cpp/tree/master/src/parquet/arrow. Nested data support is not complete yet, but it should be more complete within the next 6-12 months (sooner as contributors step up).

Wes McKinney
  • 101,437
  • 32
  • 142
  • 108
  • Sorry for the side question but trying to understand how the performance benefits of `Apache Arrow` are obtained with the Java implementation. Looking at https://github.com/apache/arrow/tree/master/java/memory/src/main/java/org/apache/arrow/memory and https://github.com/apache/arrow/tree/master/cpp/src/arrow/python makes me think that `arrow-cpp` is strictly for Python and not to be used with Java/JVM. Is this correct, Wes? – SemanticBeeng Jan 12 '18 at 07:28
  • is this in the works or in the queue of being worked on? – kgui Aug 14 '18 at 18:33
  • There has been some discussion on the mailing list. Note that open source projects don't really have "queues" -- if you really want something you often have to build it yourself, or wait for time to pass until someone else does. – Wes McKinney Aug 15 '18 at 19:31
3

Now there's an answer, Arrow can be used to convert Spark DataFrames to Pandas DataFrames or when calling Pandas UDFs. Please see the SQL PySpark Pandas with Arrow documentation page.

Douglas M
  • 1,035
  • 8
  • 17
1

Spark 3.3 will have mapInArrow API call, similar to already existing mapInPandas API call.

Here's first PR that adds this to Python - https://github.com/apache/spark/pull/34505

There will be another similar Spark Scala API call too by the time 3.3 releases.

Not sure what's exactly your use case, but this seems may help.

PS. Notice initially this API is planned as a developer-level, as working with Arrow may not be very user-friendly at first. This may be great if you're developing a library on top of Spark/Arrow, for example, when you can abstract away some of those Arrow nuances.

Tagar
  • 13,911
  • 6
  • 95
  • 110
-1

Apache Arrow is a cross-language development platform and supports in-memory columnar data structures. As it is cross-language platform it helps to write in different programming language such as Python, Java, C, C++, C#, Go, R, Ruby, JavaScript, MATLAB, Rust.

As it supports Java, it also support Scala language as both run on top of jvm. But to have Scala functionalities to convert into Scala objects to Arrow Objects, it must have to go through python because Arrow is written in python and it supports python extensively.

Ultimately Python talks with Scala and and bring it jvm property readily available to make use of it.

Please go through below link where detailed description is available: https://databricks.com/session/accelerating-tensorflow-with-apache-arrow-on-spark-bonus-making-it-available-in-scala