Spark Partition of Dataframe using Date field and run algorith on each partition

Question

I have a table which have Start and EndDate as columns. I want to partition the data month wise and run the algorithm on each month partition.

Currently, I am filtering the DataFrame using date (StartDtae and EndDate) and running algorithm for each month sequentially. For example for Jan and than feb, march and so on. We are not able to reap the benefits of SPARK parallelism by running algorithm sequentially for each month

I want to run the algorithm for each month in parallel for Jan, Feb, March....to take the advantage of parallelism of Spark.

To add more information to the question, I am running the algorithm ( which has set of steps A, B, C,D ) sequentially for each month say in a look. I want to run them in concurrently.

Please advice. How do we execute the algorithm in parallel for each month?

Can you say more about the per month algorithm? Does it also have a parallel nature or does it run sequentially over the data for each month? — mattinbits, Sep 03 '15 at 13:28
This should be marked as duplicate with reference to: https://stackoverflow.com/questions/30995699/how-to-define-partitioning-of-dataframe — Michel Lemay, Aug 10 '17 at 13:52

score 2 · Answer 1 · edited Sep 06 '15 at 00:19

2

You could simply use a groupByKey using the Month as key for each value.

edited Sep 06 '15 at 00:19

Martijn Pieters

1,048,767
296
4,058
3,343

answered Sep 03 '15 at 14:08

gprivitera

933
1
8
22

1

It is an answer, if the answer is only "how to run in parallel an algorithm on some particular data". If he has more restrictive requests I asked him for more details. – gprivitera Sep 03 '15 at 15:48
To add more information to the question, I am running the algorithm ( which has set of steps A, B, C,D ) sequentially for each month say in a look. I want to run them in concurrently. – user3061250 Sep 04 '15 at 02:46

Spark Partition of Dataframe using Date field and run algorith on each partition

1 Answers1