how to pass parameter to dictionary input for agg pyspark function

Question

From the pyspark docs, I Can do:

gdf = df.groupBy(df.name)
sorted(gdf.agg({"*": "first"}).collect())

In my actual use case I have maaaany variables, so I like that I can simply create a dictionary, which is why:

gdf = df.groupBy(df.name)
sorted(gdf.agg(F.first(col, ignorenulls=True)).collect())

@lemon's suggestion won't work for me.

How can I pass a parameter for first (i.e. ignorenulls=True), see here.

score 4 · Accepted Answer · answered Apr 08 '22 at 20:15

4

You can use list comprehension.

gdf.agg(*[F.first(x, ignorenulls=True).alias(x) for x in df.columns]).collect()

answered Apr 08 '22 at 20:15

Emma

8,518
1
18
35

lemon · Answer 2 · 2022-04-08T20:08:47.103

2

Try calling the pyspark function directly:

import pyspark.sql.functions as F

gdf = df.groupBy(df.name)

parameters = {'col': <your_column_name, 'ignorenulls': True}
sorted(gdf.agg(F.first(**parameters)).collect())

Does it work for you?

ps. ignorenulls' is True by default.

edited Apr 08 '22 at 20:08

answered Apr 08 '22 at 19:54

lemon

14,875
6
18
38

this works yes ,but I have a large list of variables so I want the dict way actually ,will clarify this in my question – safex Apr 08 '22 at 20:02
@safex I've updated the answer with the required generalization on the function parameter values. If you need also a generalization on the name of the function, you can do it using a call by name. For that, take a look here: https://stackoverflow.com/questions/3061/calling-a-function-of-a-module-by-using-its-name-a-string. – lemon Apr 08 '22 at 20:09

how to pass parameter to dictionary input for agg pyspark function

2 Answers2