HUDI does not seem to deduplicate records in some cases. Below is the configuration that we use. We partition the data by customer_id, so our expectation is that HUDI will enforce uniqueness within the partition, i.e each customer_id folder. Although, we are noticing that there are two parquet files inside some customer_id folders, and when we query the data in these partitions, we notice there are duplicate unique_user_id in the same customer_id. The _hoodie_record_key
is identical for the two duplicate records, but the _hoodie_file_name
is different, which makes me suspect that hudi is enforcing uniqueness not in the customer_id folder, but in these individual parquet files. Can someone explain this behavior?
op: "INSERT"
target-base-path: "s3_path"
target-table: "some_table_name"
source-ordering-field: "created_at"
transformer-class: "org.apache.hudi.utilities.transform.SqlQueryBasedTransformer"
filter-dupes: ""
hoodie_conf:
# source table base path
hoodie.deltastreamer.source.dfs.root: "s3_path"
# record key, partition paths and keygenerator
hoodie.datasource.write.recordkey.field: "user_id,customer_id"
hoodie.datasource.write.partitionpath.field: "customer_id"
hoodie.datasource.write.keygenerator.class:
"org.apache.hudi.keygen.ComplexKeyGenerator"
# hive sync properties
hoodie.datasource.hive_sync.enable: true
hoodie.datasource.hive_sync.table: "table_name"
hoodie.datasource.hive_sync.database: "database_name"
hoodie.datasource.hive_sync.partition_fields: "customer_id"
hoodie.datasource.hive_sync.partition_extractor_class:
"org.apache.hudi.hive.MultiPartKeysValueExtractor"
hoodie.datasource.write.hive_style_partitioning: true
# sql transformer
hoodie.deltastreamer.transformer.sql: "SELECT user_id, customer_id, updated_at as
created_at FROM <SRC> a"
# since there is no dt partition, the following config from default has to be
overridden
hoodie.deltastreamer.source.dfs.datepartitioned.selector.depth: 0