0

My task is to aggregate data by an hour (and store each as a row in DB). For aggregating one hour, there is no need to know what the other hours have. The input is json files. Important point is that these files are stored in separated folders – folder for an hour.

I have 2 questions:

  1. What is the right way to aggregate in such scenario – I'd want to "send" each hour data to different node/s and aggregate them separately in parallel – such that in the end I'll finish with a dataframe that contains only an aggregated result of each hour. I understand that simple partitioning doesn't return such dataframe.
  2. How could I take advantage of that separated folders – is it worth to read each hour data separately, and then combine all with union? (while preserving the partition like here). Is it indeed saves the "group-by" operation?
Yehezkel
  • 125
  • 1
  • 8
  • 1
    You could partitionate by hour. however Spark is already doing the parallelize under the hood if the data are correctly partitionned – BlueSheepToken Jul 23 '19 at 14:48
  • @BlueSheepToken thanks. My (2nd) question is if it's possible & better to do the partitioning using the different pathes of hours – Yehezkel Jul 23 '19 at 21:39
  • I do not think this is usefull, is it parquet ? regular csv ? A way I see to take advantage fo that is to use regular expressions to read only what you need ! – BlueSheepToken Jul 24 '19 at 07:37

0 Answers0