How to add a new column to a Delta Lake table?

Question

I'm trying to add a new column to data stored as a Delta Table in Azure Blob Storage. Most of the actions being done on the data are upserts, with many updates and few new inserts. My code to write data currently looks like this:

DeltaTable.forPath(spark, deltaPath)
      .as("dest_table")
      .merge(myDF.as("source_table"),
             "dest_table.id = source_table.id")
      .whenNotMatched()
      .insertAll()
      .whenMatched(upsertCond)
      .updateExpr(upsertStat)
      .execute()

From these docs, it looks like Delta Lake supports adding new columns on insertAll() and updateAll() calls only. However, I'm updating only when certain conditions are met and want the new column added to all the existing data (with a default value of null).

I've come up with a solution that seems extremely clunky and am wondering if there's a more elegant approach. Here's my current proposed solution:

// Read in existing data
val myData = spark.read.format("delta").load(deltaPath)
// Register table with Hive metastore
myData.write.format("delta").saveAsTable("input_data")

// Add new column
spark.sql("ALTER TABLE input_data ADD COLUMNS (new_col string)")

// Save as DataFrame and overwrite data on disk
val sqlDF = spark.sql("SELECT * FROM input_data")
sqlDF.write.format("delta").option("mergeSchema", "true").mode("overwrite").save(deltaPath)

Use jdbc not spark. This is not for that. – Lamanus Aug 22 '20 at 00:05 — Lamanus, Aug 22 '20 at 00:05
Did you get any solution ? – sp_user123 Oct 09 '20 at 14:57 — sp_user123, Oct 09 '20 at 14:57

score 12 · Accepted Answer · edited Oct 11 '20 at 15:34

12

Alter your delta table first and then you do your merge operation:

from pyspark.sql.functions import lit

spark.read.format("delta").load('/mnt/delta/cov')\
  .withColumn("Recovered", lit(''))\
  .write\
  .format("delta")\
  .mode("overwrite")\
  .option("overwriteSchema", "true")\
  .save('/mnt/delta/cov')

edited Oct 11 '20 at 15:34

Adrian Mole

49,934
160
51
83

answered Oct 10 '20 at 14:57

ashok gupta

162
3

1

Thanks! This is what I ended up doing. However, this also works (at least in Databricks on Azure): ALTER TABLE delta.`wasbs://my-table@azureaccount.blob.core.windows.net/` ADD COLUMNS (mycol STRING); – Comrade_Question Oct 23 '20 at 16:39
3

But in this way...we are not doing schema Evolution :-( – Christian Herrera Jiménez Feb 12 '21 at 23:12
that's not a merge operation, that's an overwrite operation. Have a look at the delta docs on how to merge: https://docs.delta.io/0.4.0/api/python/index.html – Max Sep 08 '22 at 03:31
you're not using schema evolution, are you loading a rewriting the data? Also how does this effect the history? – Brian Oct 20 '22 at 18:54

score 10 · Answer 2 · answered Sep 26 '22 at 19:09

10

New columns can also be added with SQL commands as follows:

ALTER TABLE dbName.TableName ADD COLUMNS (newColumnName dataType)

UPDATE dbName.TableName SET newColumnName = val;

answered Sep 26 '22 at 19:09

John Stud

1,506
23
46

Ignacio Alorre · Answer 3 · 2023-01-03T10:46:08.567

This is the approach that worked for me using scala

Having a delta table, named original_table, which path is:

val path_to_delta = "/mnt/my/path"

This table currently has got 1M records with the following schema: pk, field1, field2, field3, field4

I want to add a new field, named new_field, to the existing schema without loosing the data already stored in original_table.

So I first created a dummy record with a simple schema containing just pk and newfield

case class new_schema(
  pk: String,
  newfield: String
)

I created a dummy record using that schema:

import spark.implicits._
val dummy_record = Seq(new new_schema("delete_later", null)).toDF

I inserted this new record (the existing 1M records will have newfield populated as null). I also removed this dummy record from the original table:

dummy_record
  .write
  .format("delta")
  .option("mergeSchema", "true")
  .mode("append")
  .save(path_to_delta )

val original_dt : DeltaTable = DeltaTable.forPath(spark, path_to_delta )
original_dt .delete("pk = 'delete_later'")

Now the original table will have 6 fields: pk, field1, field2, field3, field4 and newfield

Finally I upsert the newfield values in the corresponding 1M records using pk as join key

val df_with_new_field = // You bring new data from somewhere...

original_dt 
  .as("original")
  .merge(
    df_with_new_field .as("new"),
    "original.pk = new.pk")
  .whenMatched
  .update( Map(
    "newfield" -> col("new.newfield")
    ))
  .execute()

https://www.databricks.com/blog/2019/09/24/diving-into-delta-lake-schema-enforcement-evolution.html

score -2 · Answer 4 · answered Sep 01 '20 at 09:17

-2

Have you tried using the merge statement?

https://docs.databricks.com/spark/latest/spark-sql/language-manual/merge-into.html

answered Sep 01 '20 at 09:17

Ekapol Uppapansettee

7
1
4

when I test merge into will no add the new column – Brian Dec 23 '21 at 03:38

How to add a new column to a Delta Lake table?

4 Answers4

Linked