If you have a very large data frame and/or the sizes of your date arrays vary greatly, using the Spark SQL functions to explode, calculate, and re-collect your values may be more performant than a UDF.
Setup:
import datetime
df = spark.createDataFrame([
{
"ID": "X",
"DATES": [
datetime.date(2001, 1, 1),
datetime.date(2001, 1, 3),
datetime.date(2001, 1, 10),
datetime.date(2001, 1, 25)
]
},
{
"ID": "Y",
"DATES": [
datetime.date(2003, 1, 1),
datetime.date(2006, 3, 2),
datetime.date(2007, 5, 1),
datetime.date(2009, 8, 2)
]
}
])
The most complex part of this approach is ensuring your dates and resulting remain in order throughout, which is accomplished below with the window function:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window.partitionBy("ID").orderBy("DATE")
df_intervals = df \
.select("ID", F.explode("DATES").alias("DATE")) \
.withColumn("DATES", F.collect_list("DATE").over(w)) \
.withColumn("INTERVALS", F.collect_list(F.datediff("DATE", F.lag("DATE").over(w))).over(w)) \
.groupBy("ID") \
.agg(F.max("DATES").alias("DATES"), F.max("INTERVALS").alias("INTERVALS"))