0

I have a table called zhihu_answer. I take it as the warehouse to store data scraped day by day. Every day there will create a zhihu_answer_tmp table storing the newly scraped data, which shared the same data stucture as zhihu_answer.

The DDL is:

createtab_stmt  
CREATE TABLE `zhihu_answer`(    
  `admin_closed_comment` boolean,   
  `answer_content` string,  
  `answer_created` string,  
  `answer_id` string,   
  `insert_time` string,         
  `voteup_count` int)   
PARTITIONED BY (`year_month` string)    

I took answer_id and insert_time as the unique key, my question is how to merge the new data in zhihu_answer_tmp into the history data table zhihu_answer based on answer_id and insert_time?

In specific, if there is a row with the same answer_id and insert_time existed in zhihu_answer, then do nothing, simply ignore(for Idempotency, prevent side-effect from inserting data more than twice).

On the other hand, if there is not any rows with the same answer_id and insert_time as zhihu_answer_tmp in zhihu_answer, then insert the rows(newly scraped data).

Thanks for any advice or any solution the question.

DennisLi
  • 3,915
  • 6
  • 30
  • 66
  • 1
    you can use full outer join with case statements to complete the task.. https://stackoverflow.com/questions/56939313/optimize-the-join-performance-with-hive-partition-table – vikrant rana Sep 29 '19 at 06:56
  • 1
    Possible duplicate of [Hive: Best way to do incremetal updates on a main table](https://stackoverflow.com/questions/37709411/hive-best-way-to-do-incremetal-updates-on-a-main-table) – leftjoin Sep 29 '19 at 08:15
  • thanks for your advice, I take it for reference. – DennisLi Sep 30 '19 at 08:05

0 Answers0