0

I need to update a Table Hive like

update A from B
    set 
        Col5 = A.Col2, 
        Col2 =   B.Col2, 
        DT_Change = B.DT, 
        Col3 = B.Col3, 
        Col4 = B.Col4
where A.Col1 = B.Col1  and  A.Col2 <> B.Col2

Using Scala Spark RDD

How can I do this ?

Keshav Pradeep Ramanath
  • 1,623
  • 4
  • 24
  • 33
W.R
  • 11
  • 1
  • 2
  • Welcome to Stack Overflow! Please read the guiedlines for asking a good question: https://stackoverflow.com/help/how-to-ask – TrevorBrooks Aug 29 '17 at 16:37

1 Answers1

1

I want to split this question in to two questions to explain it simple. First question : How to write Spark RDD data to Hive table ? The simplest way is to convert the RDD in to Spark SQL (dataframe) using method rdd.toDF(). Then register the dataframe as temptable using df.registerTempTable("temp_table"). Now you can query from the temptable and insert in to hive table using sqlContext.sql("insert into table my_table select * from temp_table"). Second question: How to update Hive table from Spark ? As of now, Hive is not a best fit for record level updates. Updates can only be performed on tables that support ACID. One primary limitation is only ORC format supports updating Hive tables. You can find some information on it from https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions You can refer How to Updata an ORC Hive table form Spark using Scala for this.

Few methods might have deprecated with spark 2.x and you can check spark 2.0 documentation for the latest methods. While there could be better approaches, this is the simplest approach that I can think of which works.

Satya
  • 41
  • 5
  • My question is rather how to load the 2 tables in 2 different rdd and do the update on one of the rdd then save the result in the hive table – W.R Aug 29 '17 at 19:41
  • @W.R As the data in tables is already structured, using dataframes in the place of rdds is the right approach. In this case, load the data from table 1 in to dataframe1 and table 2 in to dataframe 2. Now you can perform transformations or joins on these dataframes. For example, df1.join( df2, col("df1.col1") === col("df2.col1"), "left_outer") – Satya Aug 29 '17 at 21:39