How can I get the first non-null values from a group by? I tried using first with coalesce F.first(F.coalesce("code"))
but I don't get the desired behavior (I seem to get the first row).
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import functions as F
sc = SparkContext("local")
sqlContext = SQLContext(sc)
df = sqlContext.createDataFrame([
("a", None, None),
("a", "code1", None),
("a", "code2", "name2"),
], ["id", "code", "name"])
I tried:
(df
.groupby("id")
.agg(F.first(F.coalesce("code")),
F.first(F.coalesce("name")))
.collect())
DESIRED OUTPUT
[Row(id='a', code='code1', name='name2')]