So let's say that I have a DataFrame that is in an event based order. Basically every time something happens, I get a new event that says someone changed location or job. Here is what an example input could look like:
+--------+----+----------------+---------------+
|event_id|name| job| location|
+--------+----+----------------+---------------+
| 10| Bob| Manager| |
| 9| Joe| | HQ|
| 8| Tim| |New York Office|
| 7| Joe| |New York Office|
| 6| Joe| Head Programmer| |
| 5| Bob| | LA Office|
| 4| Tim| Manager| HQ|
| 3| Bob| |New York Office|
| 2| Bob|DB Administrator| HQ|
| 1| Joe| Programmer| HQ|
+--------+----+----------------+---------------+
In this example, 10 is the newest event and 1 is the oldest. Now I want get the newest information about each person. Here is what I would want the output to be:
+----+---------------+---------------+
|name| job| location|
+----+---------------+---------------+
| Bob| Manager| LA Office|
| Joe|Head Programmer| HQ|
| Tim| Manager|New York Office|
+----+---------------+---------------+
The current way that I do this reorganization is by collecting the data and then looping through the events, from newest to oldest in order to find the information about each person. The issue with this approach is that it is extremely slow for large DataFrame and it eventually won't all fit within the memory of one computer. What is the proper way to do this with spark?