3

Going through the Hudi documentation I saw the Metadata Config section and was curious about how it is used. I created a table enabling the metadata and the directory got created under /.hoodie/metadata. Has anybody experimented with this feature? Is the metadata exposed or only used internally to Hudi? What is it used for? I couldn't understand it from the docs.

I used the following Hudi options to create a table in S3 using PySpark.

hudi_options_insert = {
     "hoodie.table.name": "table_p5",
     "hoodie.datasource.write.table.type": "COPY_ON_WRITE",
     "hoodie.datasource.write.recordkey.field": "id",
     "hoodie.datasource.write.operation": "bulk_insert",
     "hoodie.datasource.write.partitionpath.field": "ds",
     "hoodie.datasource.write.precombine.field": "id",
     "hoodie.datasource.write.hive_style_partitioning": "true",
     "hoodie.datasource.hive_sync.table": "table_p5",
     "hoodie.datasource.hive_sync.database": "poc_hudi",
     "hoodie.datasource.hive_sync.enable": "true",
     "hoodie.datasource.hive_sync.partition_fields": "ds",
     "hoodie.insert.shuffle.parallelism": 6,
     "hoodie.metadata.enable": "true",
     "hoodie.metadata.insert.parallelism": 6
     }

Thanks a mil.

Oscar Drai
  • 141
  • 1
  • 7

1 Answers1

0

Yes the metadata table can be queried. Just do:

spark.read.format("hudi").load("path-to-hudi-table/.hoodie/metadata")

For your information the metadata table is itself a hudi table with merge on read setup and based on hfile instead of parquet/avro logs.

parisni
  • 920
  • 7
  • 20