4

When I try to apply a UDF over a Window function in PySpark, I run into the error: "AnalysisException: Expression not supported within a window function"

For eg:

sp = SparkSession(sc)
data = [(1,5,6,), (1,6,2,), (1,7,4,), (1,5,3,), (1,6,1), (1,7,5,), (2,2,5,), (2,3,3), (2,4,2,), (2,2,1,), (2,3,6,), (2,4,4,)]
df = sp.createDataFrame(data, ["acc", "val", "date"])

from pyspark.sql.window import Window
w = (Window.partitionBy(df.acc).orderBy(df.date).rangeBetween(-3,0))

def perform_some_operation(x):
    return sum(x)
perform_some_operation_udf = udf(perform_some_operation, DoubleType())

df = df.withColumn('udf_of_val_over_w', perform_some_operation_udf(df['val']).over(w))

throws an Analysis Exception because of the udf... But when I substitue the udf with a function from pyspark.sql.functions, it works:

from pyspark.sql import functions as sqlf
df = df.withColumn('udf_of_val_over_w', sqlf.avg(df['val']).over(w))

Any thoughts on how I can use a UDF in this situation? And Thanks in Advance!!

krzna
  • 185
  • 2
  • 9

0 Answers0