I have a very big table of time series data that have these columns:
- Timestamp
- LicensePlate
- UberRide#
- Speed
Each collection of LicensePlate/UberRide data should be processed considering the whole set of data. In others words, I do not need to proccess the data row by row, but all rows grouped by (LicensePlate/UberRide) together.
I am planning to use spark with dataframe api, but I am confused on how can I perform a custom calculation over spark grouped dataframe.
What I need to do is:
- Get all data
- Group by some columns
- Foreach spark dataframe group apply a f(x). Return a custom object foreach group
- Get the results by applying g(x) and returning a single custom object
How can I do steps 3 and 4? Any hints over which spark API (dataframe, dataset, rdd, maybe pandas...) should I use?
The whole workflow can be seen below: