Include null values in collect_list in pyspark

Question

I am trying to include null values in collect_list while using pyspark, however the collect_list operation excludes nulls. I have looked into the following post Pypsark - Retain null values when using collect_list . However, the answer given is not what I am looking for.

I have a dataframe df like this.

| id | family | date       |
----------------------------
| 1  |  Prod  | null       |
| 2  |  Dev   | 2019-02-02 |
| 3  |  Prod  | 2017-03-08 |

Here's my code so far:

df.groupby("family").agg(f.collect_list("date").alias("entry_date"))

This gives me an output like this:

| family | date       |
-----------------------
| Prod   |[2017-03-08]|
| Dev    |[2019-02-02]|

What I really want is as follows:

| family | date             |
-----------------------------
| Prod   |[null, 2017-03-08]|
| Dev    |[2019-02-02]      |

Can someone please help me with this? Thank you!

score 1 · Answer 1 · answered Jul 22 '19 at 08:25

A possible workaround for this could be to replace all null-values with another value. (Perhaps not the best way to do this, but it's a solution nonetheless)

df = df.na.fill("my_null") # Replace null with "my_null"
df = df.groupby("family").agg(f.collect_list("date").alias("entry_date"))

Should give you:

| family | date             |
-----------------------------
| Prod   |[my_null, 2017-03-08]|
| Dev    |[2019-02-02]      |

Include null values in collect_list in pyspark

1 Answers1

Linked