Get first non-null values in group by (Spark 1.6)

Question

How can I get the first non-null values from a group by? I tried using first with coalesce F.first(F.coalesce("code")) but I don't get the desired behavior (I seem to get the first row).

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import functions as F

sc = SparkContext("local")

sqlContext = SQLContext(sc)

df = sqlContext.createDataFrame([
    ("a", None, None),
    ("a", "code1", None),
    ("a", "code2", "name2"),
], ["id", "code", "name"])

I tried:

(df
  .groupby("id")
  .agg(F.first(F.coalesce("code")),
       F.first(F.coalesce("name")))
  .collect())

DESIRED OUTPUT

[Row(id='a', code='code1', name='name2')]

Daniel de Paula · Accepted Answer · 2016-05-20T13:59:13.763

For Spark 1.3 - 1.5, this could do the trick:

from pyspark.sql import functions as F
df.groupBy(df['id']).agg(F.first(df['code']), F.first(df['name'])).show()

+---+-----------+-----------+
| id|FIRST(code)|FIRST(name)|
+---+-----------+-----------+
|  a|      code1|      name2|
+---+-----------+-----------+

Edit

Apparently, in version 1.6 they have changed the way the first aggregate function is processed. Now, the underlying class First should be constructed with a second argument ignoreNullsExpr parameter, which is not yet used by the first aggregate function (as can bee seen here). However, in Spark 2.0 it will be able to call agg(F.first(col, True)) to ignore nulls (as can be checked here).

Therefore, for Spark 1.6 the approach must be different and a little more inefficient, unfornately. One idea is the following:

from pyspark.sql import functions as F
df1 = df.select('id', 'code').filter(df['code'].isNotNull()).groupBy(df['id']).agg(F.first(df['code']))
df2 = df.select('id', 'name').filter(df['name'].isNotNull()).groupBy(df['id']).agg(F.first(df['name']))
result = df1.join(df2, 'id')
result.show()

+---+-------------+-------------+
| id|first(code)()|first(name)()|
+---+-------------+-------------+
|  a|        code1|        name2|
+---+-------------+-------------+

Maybe there is a better option. I'll edit the answer if I find one.

Oddly, I get ("a", null, null) running the line you shared. Running Spark 1.6. — Kamil Sindi, May 20 '16 at 12:37
@capitalistpug, I tried with Spark 1.5. I'll take a look at 1.6 and see why this happens. — Daniel de Paula, May 20 '16 at 13:13
@capitalistpug, I've edited the answer after some research. Please let me know if the new solution is good enough. — Daniel de Paula, May 20 '16 at 13:51
Thanks for the awesome research. Unfortunate there doesn't seem to be a better solution. — Kamil Sindi, May 20 '16 at 13:58
There is one important issue here - `first` without window specification is nondeterministic. — zero323, May 20 '16 at 14:03

Kamil Sindi · Answer 2 · 2016-06-24T21:20:00.037

4

Because I only had one non-null value for every grouping, using min / max in 1.6 worked for my purposes:

(df
  .groupby("id")
  .agg(F.min("code"),
       F.min("name"))
  .show())

+---+---------+---------+
| id|min(code)|min(name)|
+---+---------+---------+
|  a|    code1|    name2|
+---+---------+---------+

edited Jun 24 '16 at 21:20

answered Jun 18 '16 at 16:51

Kamil Sindi

21,782
19
96
120

score 0 · Answer 3 · answered Jan 09 '23 at 23:05

0

The first method accept an argument ignorenulls, that can be set to true,

Python:

df.groupby("id").agg(first(col("code"), ignorenulls=True).alias("code"))

Scala:

df.groupBy("id").agg(first(col("code"), ignoreNulls = true).alias("code"))

answered Jan 09 '23 at 23:05

Abdennacer Lachiheb

4,388
7
30
61

Get first non-null values in group by (Spark 1.6)

3 Answers3