1

I am creating a hive table with a large data set, Is there way of creating constraints on the table so that no two rows are the same when we insert the data.

vicky
  • 13
  • 2
  • 1
    Welcome to StackOverflow! To get accurate answers, you should write what is your target and what problem is making you stuck, attach the portion of code interested and also explain what you've tried to solve the issues – xKobalt Jul 24 '20 at 07:53
  • Hi @vicky, If you are feeding the table from a `Spark` job to `Hive` table, you could `drop duplicates` before writing into the table. `Hive` is ` schema on read` so you cannot put `constraints` on it as in traditional databases like `Oracle` or `MySQL` that are `schema on write`. ` schema on read` allows unstructured data to be stored in the database. – Chema Jul 24 '20 at 10:53

1 Answers1

0

Hive does not provide validated UNIQUE, PRIMARY KEY constraints.

As of 2.1.0 Hive includes support for non-validated primary and foreign key constraints. Since these constraints are not validated, an upstream system needs to ensure data integrity before it is loaded into Hive. And as of 3.0.0 Hive includes support for UNIQUE, NOT NULL, DEFAULT and CHECK constraints. Beside UNIQUE all three type of constraints are enforced.

You can apply DISTINCT or ROW_NUMBER, to all the dataset or partition. Also you can use UNION old data with new data for simply removing duplicates. If your table is partitioned, you can rewrite partition in such way:

insert overwrite table MYTABLE partition(load_date='2020-07-25')
select col1, col2, ... colN
from MYTABLE where load_date='2020-07-25'
UNION
select  col1, col2, ... colN
from DAILY_INCREMENT_DATA

UNION will return distinct rows.

See also this answer for more details about using row_number and other loading scenarios.

Also Hive 2.2 supports MERGE in ACID mode.

leftjoin
  • 36,950
  • 8
  • 57
  • 116