I have one table with a lot of type of data, and some of the data has one information that is really important to analyse the rest of the data. This is the table that I have
name |player_id|data_ms|coins|progress |
progress | 1223 | 10 | | 128 |
complete | 1223 | 11 | 154| |
win | 1223 | 9 | 111| |
progress | 1223 | 11 | | 129 |
played | 1111 | 19 | 141| |
progress | 1111 | 25 | | 225 |
This is the table that I want
name |player_id|data_ms|coins|progress |
progress | 1223 | 10 | | 128 |
complete | 1223 | 11 | 154| 128 |
win | 1223 | 9 | 111| 129 |
progress | 1223 | 11 | | 129 |
played | 1111 | 19 | 141| 225 |
progress | 1111 | 25 | | 225 |
I need to find the progress of the player, using the condition that, it has to be the first progress emitted after the data_ms (epoch unixtimstamp) of this event.
My table has 4 bilions lines of data, it's partitioned by data.
I tried to create a UDF function that should read the table filtering it, but it's not an option since you can't serialize spark to an UDF.
Any idea of how should I do this?