Can you run a transactional data lake (Hudi, Delta Lake) with multiple EMR clusters

Question

I’m looking into several “transactional data lake” technologies such as Apache Hudi, Delta Lake, AWS Lake Formation Governed Tables.

Except for the latter, I can’t see how these would work in a multi cluster environment. I’m baselining against s3 for storage, and want to incrementally alter my data lake, where I may have many clusters all reading from and writing to the lake at any given time. Is this possible/supported? It seems like the compaction and transaction processes are on-cluster. And so you cannot manage a transactional data lake with these platforms from multiple disparate sources. Or am I mistaken?

Any anecdotes or performance limitations you’ve found would be appreciated!

Are you thinking that in a multi cluster env that compaction and transaction would somehow ruin consistency? Traditional RDBMS have solved these concerns for a while and now Hudi and Delta Lake are taking a similar approach. — Golammott, Dec 13 '21 at 22:39
It could work. Only with a centralized metastore, which is what I’m looking for information on. — zachd1_618, Dec 15 '21 at 01:32

score 1 · Answer 1 · answered Apr 09 '22 at 00:42

You can enable a config for multiple writers on Apache Hudi and then use a lock provider as described here: https://hudi.apache.org/docs/concurrency_control#enabling-multi-writing

Example using an AWS DynamoDB lock provider:

hoodie.write.lock.provider=org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider
hoodie.write.lock.dynamodb.table
hoodie.write.lock.dynamodb.partition_key
hoodie.write.lock.dynamodb.region

Delta Lake has a warning in the documentation that multiple writers may result in data loss: https://docs.delta.io/latest/delta-storage.html#amazon-s3

Concurrent writes to the same Delta table from multiple Spark drivers can lead to data loss.

This is a blog you may find interesting that discusses common pitfalls in Lakehouse concurrency control.

Can you run a transactional data lake (Hudi, Delta Lake) with multiple EMR clusters

1 Answers1