I have a table called zhihu_answer
. I take it as the warehouse to store data scraped day by day. Every day there will create a zhihu_answer_tmp
table storing the newly scraped data, which shared the same data stucture as zhihu_answer
.
The DDL is:
createtab_stmt
CREATE TABLE `zhihu_answer`(
`admin_closed_comment` boolean,
`answer_content` string,
`answer_created` string,
`answer_id` string,
`insert_time` string,
`voteup_count` int)
PARTITIONED BY (`year_month` string)
I took answer_id
and insert_time
as the unique key, my question is how to merge the new data in zhihu_answer_tmp
into the history data table zhihu_answer
based on answer_id
and insert_time
?
In specific, if there is a row with the same answer_id and insert_time
existed in zhihu_answer
, then do nothing, simply ignore(for Idempotency, prevent side-effect from inserting data more than twice).
On the other hand, if there is not any rows with the same answer_id and insert_time
as zhihu_answer_tmp
in zhihu_answer
, then insert the rows(newly scraped data).
Thanks for any advice or any solution the question.