I have an rdd that looks like this:
timestamp,user_id,search_id
[2021-08-14 14:38:31,user_a,null]
[2021-08-14 14:42:01,user_a,ABC]
[2021-08-14 14:55:12,user_a,null]
[2021-08-14 14:56:19,user_a,null]
[2021-08-14 15:01:36,user_a,null]
[2021-08-14 15:02:22,user_a,null]
[2021-08-15 07:38:07,user_b,XYZ]
[2021-08-15 07:39:59,user_b,null]
I would like to associate the events that do not have a search_id with previous search_ids by filling the null values in "search_id" with the latest non null value (when there is one) grouped by user_id.
Therefore, my output would look like this:
timestamp,user_id,search_id
[2021-08-14 14:38:31,user_a,null]
[2021-08-14 14:42:01,user_a,ABC]
[2021-08-14 14:55:12,user_a,ABC]
[2021-08-14 14:56:19,user_a,ABC]
[2021-08-14 15:01:36,user_a,ABC]
[2021-08-14 15:02:22,user_a,ABC]
[2021-08-15 07:38:07,user_b,XYZ]
[2021-08-15 07:39:59,user_b,XYZ]
I found a solution for spark dataframes that used org.apache.spark.sql.functions.last
and a window here --> Spark Window function last not null value but my context doesn't allow me to convert the rdd to a dataframe at the moment so I was wondering if any of you had an idea of how this could be done.