Duplicates record keys in apache HUDI

Question

HUDI does not seem to deduplicate records in some cases. Below is the configuration that we use. We partition the data by customer_id, so our expectation is that HUDI will enforce uniqueness within the partition, i.e each customer_id folder. Although, we are noticing that there are two parquet files inside some customer_id folders, and when we query the data in these partitions, we notice there are duplicate unique_user_id in the same customer_id. The _hoodie_record_key is identical for the two duplicate records, but the _hoodie_file_name is different, which makes me suspect that hudi is enforcing uniqueness not in the customer_id folder, but in these individual parquet files. Can someone explain this behavior?

  op: "INSERT"
  target-base-path: "s3_path"
  target-table: "some_table_name"

  source-ordering-field: "created_at"
  transformer-class: "org.apache.hudi.utilities.transform.SqlQueryBasedTransformer"

  filter-dupes: ""
  hoodie_conf:
  # source table base path
  hoodie.deltastreamer.source.dfs.root: "s3_path"

  # record key, partition paths and keygenerator
  hoodie.datasource.write.recordkey.field: "user_id,customer_id"
  hoodie.datasource.write.partitionpath.field: "customer_id"
  hoodie.datasource.write.keygenerator.class: 
  "org.apache.hudi.keygen.ComplexKeyGenerator"

  # hive sync properties
  hoodie.datasource.hive_sync.enable: true
  hoodie.datasource.hive_sync.table: "table_name"
  hoodie.datasource.hive_sync.database: "database_name"
  hoodie.datasource.hive_sync.partition_fields: "customer_id"
  hoodie.datasource.hive_sync.partition_extractor_class: 
  "org.apache.hudi.hive.MultiPartKeysValueExtractor"
  hoodie.datasource.write.hive_style_partitioning: true

  # sql transformer
  hoodie.deltastreamer.transformer.sql: "SELECT user_id, customer_id, updated_at as 
  created_at FROM <SRC> a"

  # since there is no dt partition, the following config from default has to be 
  overridden
  hoodie.deltastreamer.source.dfs.datepartitioned.selector.depth: 0

score 0 · Answer 1 · answered Sep 03 '22 at 19:11

By default, you have hoodie.merge.allow.duplicate.on.inserts=false which ensure uniqueness in each inserted file, but not in the whole partition.

If you want to enforce uniqueness within the partition you need to choose a precombine field using hoodie.datasource.write.precombine.field (date for example), hudi will use this field to decide which version of the data it should keep, the old existing one or the new one.

Field used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..)

Duplicates record keys in apache HUDI

1 Answers1