0

I am using pyspark 2.4.2, so the per the docs for this version one can do this to create a GROUPED_MAP:

from pyspark.sql.functions import pandas_udf, PandasUDFType
df = spark.createDataFrame([(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],("id", "v"))

@pandas_udf(returnType="id long, v double", functionType=PandasUDFType.GROUPED_MAP)
def subtract_mean(pdf):
    v = pdf.v
    return pdf.assign(v=v - v.mean())

df.groupby("id").apply(subtract_mean).show()

This works but you cannot call subtract_mean as a normal python function that is passed a pandas DataFrame. But if you do this, it will work:

def subtract_mean(pdf):
    v = pdf.v
    return pdf.assign(v=v - v.mean())

sub_spark = pandas_udf(f=subtract_mean, returnType="id long, v double", functionType=PandasUDFType.GROUPED_MAP)

df.groupby("id").apply(sub_spark).show()

Now you can call subtract_mean from python passing a pandas DataFrame. How does one do this using the annotation approach? It is not clear from the docs how to do this. What function is annotated and what function is given for the f parameter?

mck
  • 40,932
  • 13
  • 35
  • 50
mathfish
  • 184
  • 1
  • 12
  • the two ways are equivalent for specifying an UDF. The decorator approach is just a neater way of doing things. The function that follows the decorator is passed as the `f` parameter. If you decorated a function, I don't think you can access the original, undecorated function. – mck Jan 08 '21 at 12:47
  • I was afraid of that. Humpf – mathfish Jan 08 '21 at 13:01
  • There is also a way to get back the original function, as described here: https://stackoverflow.com/a/33024739/14165730 . perhaps `subtract_mean.__wrapped__` will give you back the original undecorated function. – mck Jan 08 '21 at 13:05
  • Yes! That totally works: `pandas_dataframe.groupby("id").apply(subtract_mean.__wrapped__)` – mathfish Jan 08 '21 at 13:11
  • I would suggest using the second approach in your question though. Using `__wrapped__` makes the code less readable. – mck Jan 08 '21 at 13:12
  • Yes, I agree but I only need this for ease of unit testing the pandas udfs without spark. Thanks again. – mathfish Jan 08 '21 at 13:14
  • Feel free to formalize this in an answer and I'll accept it. – mathfish Jan 08 '21 at 13:15

1 Answers1

1

The two ways are equivalent for specifying an UDF. The decorator approach is just a neater way of doing things. The function that follows the decorator is passed as the f parameter.

As described in this answer, you can use subtract_mean.__wrapped__ to get back the original undecorated function. The second approach in your question is more readble though. Using __wrapped__ makes the code less readable. But if it's just for unit testing purposes, it should be fine.

mck
  • 40,932
  • 13
  • 35
  • 50