I have little experience in Hive and currently learning Spark with Scala. I am curious to know whether Hive on Tez really faster than SparkSQL. I searched many forums with test results but they have compared older version of Spark and most of them are written in 2015. Summarized main points below
- ORC will do the same as parquet in Spark
- Tez engine will give better performance like Spark engine
- Joins are better/faster in Hive than Spark
I feel like Hortonworks supports more for Hive than Spark and Cloudera vice versa.
sample links :
Initially I thought Spark would be faster than anything because of their in-memory execution. after reading some articles I got Somehow existing Hive also getting improvised with new concepts like Tez, ORC, LLAP etc.
Currently running with PL/SQL Oracle and migrating to big data since volumes are getting increased. My requirements are kind of ETL batch processing and included data details involved in every weekly batch runs. Data will increase widely soon.
Input/lookup data are csv/text formats and updating into tables
Two input tables which has 5 million rows and 30 columns
- 30 look up tables used to generate each column of output table which contains around 10 million rows and 220 columns.
- Multiple joins involved like inner and left outer since many look up tables used.
Kindly please advise which one of below method I should choose for better performance with readability and easy to include minor updates on columns for future production deployment.
Method 1:
- Hive on Tez with ORC tables
- Python UDF thru TRANSFORM option
- Joins with performance tuning like map join
Method 2:
- SparkSQL with Parquet format which is converting from text/csv
- Scala for UDF
- Hope we can perform multiple inner and left outer join in Spark