How to maintain Primary Key columns in Databricks Delta Multi Cluster environment

Question

I am trying to replicate the SQL DB like feature of maintaining the Primary Keys in Databrciks Delta approach where the data is being written to Blob Storage such as ADLS2 or AWS S3.

I want a Auto Incremented Primary key feature using Databricks Delta.

Existing approach - is using the latest row count and maintaining the Primary keys. However, this approach does not suit in parallel processing environment where Primary keys get duplicated data.

Possible duplicate of [Primary keys with Apache Spark](https://stackoverflow.com/questions/33102727/primary-keys-with-apache-spark) — simon_dmorias, Aug 27 '19 at 14:50
I've flagged as duplicate. This isn't a Databricks Delta issue - rather a Spark in general issue. Ideally I would not use an incremental key - they don't work in a distributed world. Instead try a guid - or look at a function called monotonicallyIncreasingId. — simon_dmorias, Aug 27 '19 at 14:52

score 1 · Answer 1 · answered Sep 19 '22 at 05:32

Creating an Identity column in your table is the way to solve this problem. Identity Columns are now GA (Generally Available) in Databricks Runtime 10.4+ and in Databricks SQL 2022.17+.

Databricks Documentation for creating Identity columns : https://www.databricks.com/blog/2022/08/08/identity-columns-to-generate-surrogate-keys-are-now-available-in-a-lakehouse-near-you.html

How to maintain Primary Key columns in Databricks Delta Multi Cluster environment

1 Answers1