ORC vs Parquet File Formats

Question

I have read many blogs and articles that quotes "ORC file format works very well with Apache Hive, Parquet works extremely well with Apache Spark" but don't really have a proper detailed explanation on the same.

Please provide me some example to justify the same.

There are plenty of opinions and comparison posts. Have you happen to see this for example? https://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy — mazaneicha, Aug 07 '20 at 21:18
Does this answer your question? [Parquet vs ORC vs ORC with Snappy](https://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy) — thebluephantom, Aug 08 '20 at 10:02

Shrey Jakhmola · Answer 1 · 2020-09-17T15:06:37.460

8

Hive has a vectorized ORC reader but no vectorized parquet reader and spark has a vectorized parquet reader and no vectorized ORC reader. Spark performs best with parquet, hive performs best with ORC.

Vectorization means that rows are decoded in batches, dramatically improving memory locality and cache utilization.

[Update]

Spark 2.3 has already introduced a native vectorized ORC reader which adds improvement in reading ORC files along with native parquet reader.

edited Sep 17 '20 at 15:06

answered Aug 08 '20 at 08:27

Shrey Jakhmola

522
3
9

1

Since Spark 2.3, Spark supports a vectorized ORC reader – Andrew White Sep 17 '20 at 14:57

ORC vs Parquet File Formats

1 Answers1