10

I was just wondering what people's thoughts were on reading from Hive vs reading from a .csv file or a .txt file or an .ORC file, or a .parquet file. Assuming the underlying Hive table is an external table that has the same file format, would you rather read form a Hive table or from the underlying file itself, and why?

Mike

uh_big_mike_boi
  • 3,350
  • 4
  • 33
  • 64
  • You may find the most recent answer to https://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy quite interesting >> vectorization is a real bonus... As well as "predicate push-down", "skip scans" based on local stats (both features require columnar formats), efficient partition pruning, appropriate compression... – Samson Scharfrichter May 22 '17 at 20:39
  • Good question Big Mike – thebluephantom Jun 13 '19 at 20:35

2 Answers2

9

tl;dr : I would read it straight from the parquet files

I am using Spark 1.5.2 and Hive 1.2.1 For a 5Million row X 100 column table some timings I've recorded are

val dffile = sqlContext.read.parquet("/path/to/parquets/*.parquet")
val dfhive = sqlContext.table("db.table")

dffile count --> 0.38s; dfhive count --> 8.99s

dffile sum(col) --> 0.98s; dfhive sum(col) --> 8.10s

dffile substring(col) --> 2.63s; dfhive substring(col) --> 7.77s

dffile where(col=value) --> 82.59s; dfhive where(col=value) --> 157.64s

Note that these were done with an older version of Hive and an older version of Spark so I can't comment on how speed improvements could have occurred between the two reading mechanisms

James Tobin
  • 3,070
  • 19
  • 35
1

From what I understand, even though in general .ORC is better suited for flat structures and parquet for nested ones, spark is optimised towards parquet. Therefore, it is advised to use that format with spark.

Furthermore, Metadata for all your read tables from parquet will be stored in hiveanyway. This is spark doc:Spark SQL caches Parquet metadata for better performance. When Hive metastore Parquet table conversion is enabled, metadata of those converted tables are also cached. If these tables are updated by Hive or other external tools, you need to refresh them manually to ensure consistent metadata.

I tend to transform data asap into parquet format and store it alluxio backed by hdfs. This allows me to achieve better performance for read/write operations, and limit using cache.

I hope it helps.

elcomendante
  • 1,113
  • 1
  • 11
  • 28
  • That's cool that you are using Alluxio! But my question is specifically about Hive vs Non-Hive. Whether it's ORC, Parquet, whatever, that could be defined by something else. It might be a requirement to have ORC. I am just asking about Hive table vs just the file. – uh_big_mike_boi May 22 '17 at 20:02
  • Even though I think that Spark is still better with parquet, it has to be mentioned that Spark continuously works on its ORC-support which is available since Spark 2.3 (https://spark.apache.org/docs/latest/sql-data-sources-orc.html). So this might change in the future. – Markus Jun 07 '19 at 12:19