3

I’m looking into several “transactional data lake” technologies such as Apache Hudi, Delta Lake, AWS Lake Formation Governed Tables.

Except for the latter, I can’t see how these would work in a multi cluster environment. I’m baselining against s3 for storage, and want to incrementally alter my data lake, where I may have many clusters all reading from and writing to the lake at any given time. Is this possible/supported? It seems like the compaction and transaction processes are on-cluster. And so you cannot manage a transactional data lake with these platforms from multiple disparate sources. Or am I mistaken?

Any anecdotes or performance limitations you’ve found would be appreciated!

zachd1_618
  • 4,210
  • 6
  • 34
  • 47
  • Are you thinking that in a multi cluster env that compaction and transaction would somehow ruin consistency? Traditional RDBMS have solved these concerns for a while and now Hudi and Delta Lake are taking a similar approach. – Golammott Dec 13 '21 at 22:39
  • It could work. Only with a centralized metastore, which is what I’m looking for information on. – zachd1_618 Dec 15 '21 at 01:32

1 Answers1

1

You can enable a config for multiple writers on Apache Hudi and then use a lock provider as described here: https://hudi.apache.org/docs/concurrency_control#enabling-multi-writing

Example using an AWS DynamoDB lock provider:

hoodie.write.lock.provider=org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider
hoodie.write.lock.dynamodb.table
hoodie.write.lock.dynamodb.partition_key
hoodie.write.lock.dynamodb.region

Delta Lake has a warning in the documentation that multiple writers may result in data loss: https://docs.delta.io/latest/delta-storage.html#amazon-s3

Concurrent writes to the same Delta table from multiple Spark drivers can lead to data loss.

This is a blog you may find interesting that discusses common pitfalls in Lakehouse concurrency control.

Kyle Weller
  • 2,533
  • 9
  • 35
  • 45