146

I'm looking into a way to speed up my memory intensive frontend vis app. I saw some people recommend Apache Arrow, while I'm looking into it, I'm confused about the difference between Parquet and Arrow.

They are both columnized data structure. Originally I thought parquet is for disk, and arrow is for in-memory format. However, I just learned that you can save arrow into files at desk as well, like abc.arrow In that case, what's the difference? Aren't they doing the same thing?

Audrey
  • 1,728
  • 2
  • 11
  • 10

1 Answers1

360

Parquet is a columnar file format for data serialization. Reading a Parquet file requires decompressing and decoding its contents into some kind of in-memory data structure. It is designed to be space/IO-efficient at the expense of CPU utilization for decoding. It does not provide any data structures for in-memory computing. Parquet is a streaming format which must be decoded from start-to-end, while some "index page" facilities have been added to the storage format recently, in general random access operations are costly.

Arrow on the other hand is first and foremost a library providing columnar data structures for in-memory computing. When you read a Parquet file, you can decompress and decode the data into Arrow columnar data structures, so that you can then perform analytics in-memory on the decoded data. Arrow columnar format has some nice properties: random access is O(1) and each value cell is next to the previous and following one in memory, so it's efficient to iterate over.

What about "Arrow files" then? Apache Arrow defines a binary "serialization" protocol for arranging a collection of Arrow columnar arrays (called a "record batch") that can be used for messaging and interprocess communication. You can put the protocol anywhere, including on disk, which can later be memory-mapped or read into memory and sent elsewhere.

This Arrow protocol is designed so that you can "map" a blob of Arrow data without doing any deserialization, so performing analytics on Arrow protocol data on disk can use memory-mapping and pay effectively zero cost. The protocol is used for many things, such as streaming data between Spark SQL and Python for running pandas functions against chunks of Spark SQL data, these are called "pandas udfs".

In some applications, Parquet and Arrow can be used interchangeably for on-disk data serialization. Some things to keep in mind:

  • Parquet is intended for "archival" purposes, meaning if you write a file today, we expect that any system that says they can "read Parquet" will be able to read the file in 5 years or 7 years. We are not yet making this assertion about long-term stability of the Arrow format (though we might in the future)
  • Parquet is generally a lot more expensive to read because it must be decoded into some other data structure. Arrow protocol data can simply be memory-mapped.
  • Parquet files are often much smaller than Arrow-protocol-on-disk because of the data encoding schemes that Parquet uses. If your disk storage or network is slow, Parquet is going to be a better choice

So, in summary, Parquet files are designed for disk storage, Arrow is designed for in-memory (but you can put it on disk, then memory-map later). They are intended to be compatible with each other and used together in applications.

For a memory-intensive frontend app I might suggest looking at the Arrow JavaScript (TypeScript) library.

Wes McKinney
  • 101,437
  • 32
  • 142
  • 108
  • @Wes McKinney, "in-memory" means there should be some kind of platform keep these arrow data in memory? some in-memory computing platform running in a large cluster ? (for eg somelike like Ignite) ? – Ashika Umanga Umagiliya Dec 16 '19 at 05:36
  • @WesMcKinney thank you for great explanation. It was interesting to read and very useful – VB_ May 11 '20 at 13:41
  • do you have any tutorials or examples on in memory access on an arrow table? – SnG Oct 23 '20 at 19:43
  • As of 1.0.0 (July 2020) Arrow is backwards compatible. See here for info: https://arrow.apache.org/blog/2020/07/24/1.0.0-release/ – johnml1135 Nov 05 '20 at 14:59
  • Arrow has a FAQ [page](https://arrow.apache.org/faq/) that is more up-to-date. And as of Jul 11, 2021, the FAQ suggests Parquet is still the choice (over Arrow) for long term storage -- `While the Arrow on-disk format is stable and will be readable by future versions of the libraries, it does not prioritize the requirements of long-term archival storage`. – HCSF Jul 11 '21 at 05:17
  • 4
    why is parquet mentioned as a streaming format? what does this mean? I assumed that we have to parse the entire file to be able to read the data. – ns15 Jul 11 '21 at 09:40
  • "Parquet is a columnar file format" <- Is there a reliable reference for this? I see many times parquet referenced as "hybrid" (something between columnar and row-wise formats) – Niko Föhr May 09 '22 at 12:52
  • See, for example [The Parquet Format and Performance Optimization Opportunities Boudewijn Braams (Databricks)](https://youtu.be/1j8SdS7s_NY?t=615) [10:15-10:45] – Niko Föhr May 09 '22 at 13:14
  • @np8 Its hybrid columnar – ns15 Oct 24 '22 at 13:37
  • Is it possible to query/read a large Parquet of Arrow file directly from S3 without having to download it entirely (e.g. using byte ranges)? – collimarco Mar 30 '23 at 08:23
  • Yes, use [fsspec](https://filesystem-spec.readthedocs.io/en/latest/features.html#pyarrow-integration) by Martin Durant. It is supported by pyarrow and e.g. [dask](https://docs.dask.org/en/stable/how-to/connect-to-remote-data.html) to do powerful things when your storage is remote. – hard Mar 31 '23 at 10:47