Is it possible to run multiple aggregation jobs on a single dataframe in parallel in spark?

Question

Is there any way to run multiple independent aggregation jobs on a single RDD in parallel? First preference is Python then Scala and Java.

The course of actions in order of preference are -

Using Threadpool - run different functions doing different aggregations on different threads. I did not see an example which does this.
Using cluster mode on yarn , submitting different jars. Is this possible , if yes then is it possible in pyspark?
Using Kafka - run different spark-submits on the dataframe streaming through kafka.

I am quite new to Spark , and my experience ranges on running Spark on Yarn for ETL doing multiple aggregations serially. I was thinking if it was possible to run these aggregations in parallel as they are mostly independent.

score 0 · Answer 1 · answered Jun 25 '16 at 12:48

0

Consider your broad question, here is a broad answer :

Yes, it is possible to run multiple aggregation jobs on a single DataFrame in parallel.

For the rest, it doesn't seem to be clear what you are asking.

answered Jun 25 '16 at 12:48

eliasah

39,588
11
124
154

1

Thanks ! Could you please indicate how to do that? I tried to search for the same but encountered pretty vague answers. I am sorry for not being clear about what I am asking. – preitam ojha Jun 25 '16 at 17:17
I'm sorry. I can't elaborate more. It's quite broad. Spark is a parallel data processing engine. I can't give a specific answer to such a broad question. Please read on how to ask question on StackOverflow, it might help you review your question. – eliasah Jun 26 '16 at 12:29

Is it possible to run multiple aggregation jobs on a single dataframe in parallel in spark?

1 Answers1