I have two pyspark dataframes:
1st dataframe: plants
+-----+--------+
|plant|station |
+-----+--------+
|Kech | st1 |
|Casa | st2 |
+-----+--------+
2nd dataframe: stations
+-------+--------+
|program|station |
+-------+--------+
|pr1 | null|
|pr2 | st1 |
+-------+--------+
What i want is to replace the null values in the second dataframe stations with all the column station in the first dataframe. Like this :
+-------+--------------+
|program|station |
+-------+--------------+
|pr1 | [st1, st2]|
|pr2 | st1 |
+-------+--------------+
I did this:
stList = plants.select(F.col('station')).rdd.map(lambda x: x[0]).collect()
stations = stations.select(
F.col('program')
F.when(stations.station.isNull(), stList).otherwise(stations.station).alias('station')
)
but it gives me an error when doesn't accept python list as a parameter