3

From the pyspark docs, I Can do:

gdf = df.groupBy(df.name)
sorted(gdf.agg({"*": "first"}).collect())

In my actual use case I have maaaany variables, so I like that I can simply create a dictionary, which is why:

gdf = df.groupBy(df.name)
sorted(gdf.agg(F.first(col, ignorenulls=True)).collect())

@lemon's suggestion won't work for me.

How can I pass a parameter for first (i.e. ignorenulls=True), see here.

safex
  • 2,398
  • 17
  • 40

2 Answers2

4

You can use list comprehension.

gdf.agg(*[F.first(x, ignorenulls=True).alias(x) for x in df.columns]).collect()
Emma
  • 8,518
  • 1
  • 18
  • 35
2

Try calling the pyspark function directly:

import pyspark.sql.functions as F

gdf = df.groupBy(df.name)

parameters = {'col': <your_column_name, 'ignorenulls': True}
sorted(gdf.agg(F.first(**parameters)).collect())

Does it work for you?

ps. ignorenulls' is True by default.

lemon
  • 14,875
  • 6
  • 18
  • 38
  • this works yes ,but I have a large list of variables so I want the dict way actually ,will clarify this in my question – safex Apr 08 '22 at 20:02
  • @safex I've updated the answer with the required generalization on the function parameter values. If you need also a generalization on the name of the function, you can do it using a call by name. For that, take a look here: https://stackoverflow.com/questions/3061/calling-a-function-of-a-module-by-using-its-name-a-string. – lemon Apr 08 '22 at 20:09