I am trying to include null values in collect_list
while using pyspark
, however the collect_list
operation excludes nulls
. I have looked into the following post Pypsark - Retain null values when using collect_list . However, the answer given is not what I am looking for.
I have a dataframe df
like this.
| id | family | date |
----------------------------
| 1 | Prod | null |
| 2 | Dev | 2019-02-02 |
| 3 | Prod | 2017-03-08 |
Here's my code so far:
df.groupby("family").agg(f.collect_list("date").alias("entry_date"))
This gives me an output like this:
| family | date |
-----------------------
| Prod |[2017-03-08]|
| Dev |[2019-02-02]|
What I really want is as follows:
| family | date |
-----------------------------
| Prod |[null, 2017-03-08]|
| Dev |[2019-02-02] |
Can someone please help me with this? Thank you!