how to avoid duplicate entries in hive?

Question

Create a table with primary key in Hive. Insert the identical data record several times.

How can you avoid that the data record (primary key) is not inserted more than once without using a second temporary table?

drop table t1;

CREATE TABLE IF NOT EXISTS `t1` (
`ID`        BIGINT DEFAULT SURROGATE_KEY(),
`Name`      STRING NOT NULL DISABLE NOVALIDATE,
CONSTRAINT `PK_t1` PRIMARY KEY (`ID`) DISABLE NOVALIDATE);

select * from t1;
+--------+----------+
| t1.id  | t1.name  |
+--------+----------+
+--------+----------+

insert into t1 values (1, "Hi");
insert into t1 values (1, "Hi");
insert into t1 values (1, "Hi");

select * from t1;
+--------+----------+
| t1.id  | t1.name  |
+--------+----------+
| 1      | Hi       |
| 1      | Hi       |
| 1      | Hi       |
+--------+----------+

I tried unsuccessfully with a merge: MERGE INTO t1 USING (select * from t1) sub ON sub.id != t1.id
WHEN not matched then insert values (2, "World");

Using UNION ALL + row_number or using FULL JOIN: https://stackoverflow.com/a/44755825/2700344 — leftjoin, Mar 30 '20 at 12:58
Not clear how it should be implemented because the link uses 2 tables. I have a (partial) solution: MERGE INTO t1 USING (SELECT 1 AS id, "hi" AS name) sub ON sub.id = t1.id WHEN not matched then insert values (1, "hi"); The disadvantage is that the data has to be entered twice. Can you avoid that? — debo, Mar 30 '20 at 14:52
Yes, two tables. One table (new data) can be subquery or CTE — leftjoin, Mar 30 '20 at 14:58

how to avoid duplicate entries in hive?

0 Answers0