0

Is there any way to run multiple independent aggregation jobs on a single RDD in parallel? First preference is Python then Scala and Java.

The course of actions in order of preference are -

  1. Using Threadpool - run different functions doing different aggregations on different threads. I did not see an example which does this.

  2. Using cluster mode on yarn , submitting different jars. Is this possible , if yes then is it possible in pyspark?

  3. Using Kafka - run different spark-submits on the dataframe streaming through kafka.

I am quite new to Spark , and my experience ranges on running Spark on Yarn for ETL doing multiple aggregations serially. I was thinking if it was possible to run these aggregations in parallel as they are mostly independent.

preitam ojha
  • 239
  • 1
  • 2
  • 7

1 Answers1

0

Consider your broad question, here is a broad answer :

Yes, it is possible to run multiple aggregation jobs on a single DataFrame in parallel.

For the rest, it doesn't seem to be clear what you are asking.

eliasah
  • 39,588
  • 11
  • 124
  • 154
  • 1
    Thanks ! Could you please indicate how to do that? I tried to search for the same but encountered pretty vague answers. I am sorry for not being clear about what I am asking. – preitam ojha Jun 25 '16 at 17:17
  • I'm sorry. I can't elaborate more. It's quite broad. Spark is a parallel data processing engine. I can't give a specific answer to such a broad question. Please read on how to ask question on StackOverflow, it might help you review your question. – eliasah Jun 26 '16 at 12:29