How to apply multiple methods at a time in Spark?

Question

df is a dataframe contains all car data(| id | time | speed | gps |...|);

trips is a series list contains(id,start,end) which generate from df.

method1 is used to get each id's stats information. method2 is used to get each id's other stats information.

Like this code:

val a = method1(trips,df,sc)
val b = method2(trips,df,sc)
val c = method3(trips,df,sc)
val d = method4(trips,df,sc)
val e = method5(trips,df,sc)
val f = method6(trips,df,sc)

Because each method take a certain time, is there any way to apply the methods for assignments at the same time? The type of a,b...,f is dataframe.

@zero323 spark runs operations in parallel anyway, because the calls are lazy - unless you collect somewhere. — Reactormonk, Jan 05 '16 at 06:35
@zero323 Indeed, each method have to do series work(dataframe filter,select and compute after dataframe.collect() ) — steven, Jan 05 '16 at 06:37
It's hard to do without collect action; Because I have to do datetime diff for one column(time) , there is no available method in spark api now. The purpose to do this is to divide the whole data into trips based on drive time. Is it possible to use akka.actor method to achieve my target？ — steven, Jan 05 '16 at 06:49
@steven Unless it is a matter of suboptimal resource utilization (small data frames, a lot of resources) it is not worth all the fuss. You can submit jobs asynchronously using futures or parallel maps over functions and force execution but it is usually pointless. Based on your last comment it looks more like XY problem anyway. — zero323, Jan 05 '16 at 06:51
@Reactormonk I aked the problem several days ago. In the last, I take scala sliding method for collect() data. http://stackoverflow.com/questions/33935363/how-to-compute-datetime-diff-for-one-col-in-spark-dataframe — steven, Jan 05 '16 at 06:57
I guess linked to a confusing question there... Take a look at this for example http://stackoverflow.com/q/33598541/1560062 — zero323, Jan 05 '16 at 07:05
Also there is a new library from Cloudera:https://github.com/cloudera/spark-timeseries — zero323, Jan 05 '16 at 07:08
@Reactormonk Laziness doesn't really affect parallelism. How different tasks are executed is a matter of DAGs and resource management. — zero323, Jan 05 '16 at 07:10
@zero323 that is correct. But laziness gives Spark the ability to do parallelism for you. If you `collect` in between, you're taking that ability away from Spark. — Reactormonk, Jan 05 '16 at 07:21
@zero323 thank you and Reactormonk; like you said "You can submit jobs asynchronously using futures or parallel maps over functions and force execution but it is usually pointless", can you give an example about parallel maps over functions for this. — steven, Jan 05 '16 at 07:48
http://docs.scala-lang.org/overviews/parallel-collections/overview.html, http://stackoverflow.com/questions/31912858/processing-multiple-files-as-independent-rdds-in-parallel/31916657#31916657 or solution provided by @SandeepPurohit — zero323, Jan 05 '16 at 11:08

score 0 · Answer 1 · answered Jan 05 '16 at 07:21

yes you can run multiple jobs at a same time in spark cluster with the help of asynchronous actions like collectAsync(),countAsync() etc.

 yo just set configuration with context .set("spark.scheduler.mode", "FAIR")

And use asynchronous actions and so all jobs runs asynchronously and its return future so your methods also return future so all methods run at a time.

How to apply multiple methods at a time in Spark?

1 Answers1