Spark create dataframe - from hive table or from parquet file

Asked Jun 13 '19 at 18:23

Active Jun 13 '19 at 20:27

Viewed 385 times

I have a scenario in which I must prepare multiple dataframes which will be used for joins.

These dataframes are to be formed by selecting a few columns in source. Source files are parquet based and there is an external table upon each parquet file folder.

My question is what among below two gives best performance?

Dataframe frame1 = spark.read.fomat(parquet).load(parquet-location).select(few columns here)

Dataframe frame2 = spark.sql(select few columns here from HIVEDB.Table_upon_parquet_files)

Which dataframe would build faster?? Frame1 or Frame2. If one is better than other, why?? Please explain.

edited Jun 13 '19 at 20:27

thebluephantom

16,458
8
40
83

asked Jun 13 '19 at 18:23

Lokesh Raju

Why not try and see? – thebluephantom Jun 13 '19 at 20:27
S3 or other, or HDFS? – thebluephantom Jun 13 '19 at 20:31
Possible duplicate of [Is it better for Spark to select from hive or select from file](https://stackoverflow.com/questions/44120162/is-it-better-for-spark-to-select-from-hive-or-select-from-file) – thebluephantom Jun 13 '19 at 20:34

Spark create dataframe - from hive table or from parquet file

0 Answers0