My task is to aggregate data by an hour (and store each as a row in DB). For aggregating one hour, there is no need to know what the other hours have. The input is json files. Important point is that these files are stored in separated folders – folder for an hour.
I have 2 questions:
- What is the right way to aggregate in such scenario – I'd want to "send" each hour data to different node/s and aggregate them separately in parallel – such that in the end I'll finish with a dataframe that contains only an aggregated result of each hour. I understand that simple partitioning doesn't return such dataframe.
- How could I take advantage of that separated folders – is it worth to read each hour data separately, and then combine all with union? (while preserving the partition like here). Is it indeed saves the "group-by" operation?