I have this Hive MERGE statement:
MERGE INTO destination dst
USING (
SELECT
-- DISTINCT fields
company
, contact_id as id
, ct.cid as cid
-- other fields
, email
, timestamp_utc
-- there are actually about 6 more
-- deduplication
, ROW_NUMBER() OVER (
PARTITION BY company
, ct.id
, contact_id
ORDER BY timestamp_utc DESC
) as r
FROM
source
LATERAL VIEW explode(campaign_id) ct AS cid
) src
ON
dst.company = src.company
AND dst.campaign_id = src.cid
AND dst.id = src.id
-- On match: keep latest loaded
WHEN MATCHED
AND dst.updated_on_utc < src.timestamp_utc
AND src.r = 1
THEN UPDATE SET
email = src.email
, updated_on_utc = src.timestamp_utc
WHEN NOT MATCHED AND src.r = 1 THEN INSERT VALUES (
src.id
, src.email
, src.timestamp_utc
, src.license_name
, src.cid
)
;
Which runs for a very long time (30 minutes for 7GB of avro compressed data on disk). I wonder if there are any SQL ways to improve it.
ROW_NUMBER() is here to deduplicate the source table, so that in the MATCH clause we only select the earliest row.
One thing I am not sure of, is that hive says:
SQL Standard requires that an error is raised if the ON clause is such that more than 1 row in source matches a row in target. This check is computationally expensive and may affect the overall runtime of a MERGE statement significantly. hive.merge.cardinality.check=false may be used to disable the check at your own risk. If the check is disabled, but the statement has such a cross join effect, it may lead to data corruption.
I do indeed disable the cardinality check, as although the ON statement might give 2 rows in source, those rows are limited to 1 only thanks to the r=1 later in the MATCH clause.
Overall I like this MERGE statement but it is just too slow and any help would be appreciated.
Note that the destination table is partitioned. The source table is not as it is an external table which for every run must be fully merged, so fully scanned (in the background already merged data files are removed and new files are added before next run). Not sure that partitioning would help in that case
What I have done:
- play with hdfs/hive/yarn configuration
- try with a temporary table (2 steps) instead of a single MERGE, the run time jumped to more than 2 hours.