Databricks' documentation on UDFs shows very simple examples, e.g. integer transformation with integers as parameters (https://docs.databricks.com/spark/latest/spark-sql/udf-python.html), but says nothing about passing Delta Live Tables as a parameters.
In my case, I have two DLTs that are being created, and I then need to perform some transformations on them that are only possible with pandas, finally receiving a DLT as well.
I need something like this:
@dlt.table
def dlt1():
return spark.sql("query...")
@dlt.table
def dlt2():
return spark.sql("query...")
@dlt.table
def dlt3():
return spark.sql("SELECT * FROM search(LIVE.dlt1, LIVE.dlt2)")
def search(dlt1, dlt2):
dlt2_not_null = list(dlt2.filter("someColumn").isNotNull()).distinct().toPandas()["someColumn"].values)
result = dlt1.filter(~dlt1["someColumn"].isin(dlt2_not_null))
return result
spark.udf.register("search", search)