3

I have read many blogs and articles that quotes "ORC file format works very well with Apache Hive, Parquet works extremely well with Apache Spark" but don't really have a proper detailed explanation on the same.

Please provide me some example to justify the same.

SNS
  • 93
  • 1
  • 10
  • 2
    There are plenty of opinions and comparison posts. Have you happen to see this for example? https://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy – mazaneicha Aug 07 '20 at 21:18
  • Does this answer your question? [Parquet vs ORC vs ORC with Snappy](https://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy) – thebluephantom Aug 08 '20 at 10:02

1 Answers1

8

Hive has a vectorized ORC reader but no vectorized parquet reader and spark has a vectorized parquet reader and no vectorized ORC reader. Spark performs best with parquet, hive performs best with ORC.

Vectorization means that rows are decoded in batches, dramatically improving memory locality and cache utilization.

[Update]

Spark 2.3 has already introduced a native vectorized ORC reader which adds improvement in reading ORC files along with native parquet reader.

Shrey Jakhmola
  • 522
  • 3
  • 9