From the pyspark docs, I Can do:
gdf = df.groupBy(df.name)
sorted(gdf.agg({"*": "first"}).collect())
In my actual use case I have maaaany variables, so I like that I can simply create a dictionary, which is why:
gdf = df.groupBy(df.name)
sorted(gdf.agg(F.first(col, ignorenulls=True)).collect())
@lemon's suggestion won't work for me.
How can I pass a parameter for first
(i.e. ignorenulls=True
), see here.