0

df is a dataframe contains all car data(| id | time | speed | gps |...|);

trips is a series list contains(id,start,end) which generate from df.

method1 is used to get each id's stats information. method2 is used to get each id's other stats information.

Like this code:

val a = method1(trips,df,sc)
val b = method2(trips,df,sc)
val c = method3(trips,df,sc)
val d = method4(trips,df,sc)
val e = method5(trips,df,sc)
val f = method6(trips,df,sc)

Because each method take a certain time, is there any way to apply the methods for assignments at the same time? The type of a,b...,f is dataframe.

steven
  • 13
  • 2
  • 6
  • 1
    Are the type signatures the same? – Reactormonk Jan 05 '16 at 06:30
  • yes, the type signatures are same – steven Jan 05 '16 at 06:34
  • @zero323 spark runs operations in parallel anyway, because the calls are lazy - unless you collect somewhere. – Reactormonk Jan 05 '16 at 06:35
  • @zero323 Indeed, each method have to do series work(dataframe filter,select and compute after dataframe.collect() ) – steven Jan 05 '16 at 06:37
  • @steven can't do without collect? – Reactormonk Jan 05 '16 at 06:44
  • It's hard to do without collect action; Because I have to do datetime diff for one column(time) , there is no available method in spark api now. The purpose to do this is to divide the whole data into trips based on drive time. Is it possible to use akka.actor method to achieve my target? – steven Jan 05 '16 at 06:49
  • @steven you could write an UDAF. It's a bit hairy though. – Reactormonk Jan 05 '16 at 06:51
  • @steven Unless it is a matter of suboptimal resource utilization (small data frames, a lot of resources) it is not worth all the fuss. You can submit jobs asynchronously using futures or parallel maps over functions and force execution but it is usually pointless. Based on your last comment it looks more like XY problem anyway. – zero323 Jan 05 '16 at 06:51
  • @Reactormonk I aked the problem several days ago. In the last, I take scala sliding method for collect() data. http://stackoverflow.com/questions/33935363/how-to-compute-datetime-diff-for-one-col-in-spark-dataframe – steven Jan 05 '16 at 06:57
  • Why not use `sliding` on RDD? – zero323 Jan 05 '16 at 07:01
  • I guess linked to a confusing question there... Take a look at this for example http://stackoverflow.com/q/33598541/1560062 – zero323 Jan 05 '16 at 07:05
  • Also there is a new library from Cloudera:https://github.com/cloudera/spark-timeseries – zero323 Jan 05 '16 at 07:08
  • @Reactormonk Laziness doesn't really affect parallelism. How different tasks are executed is a matter of DAGs and resource management. – zero323 Jan 05 '16 at 07:10
  • @zero323 that is correct. But laziness gives Spark the ability to do parallelism for you. If you `collect` in between, you're taking that ability away from Spark. – Reactormonk Jan 05 '16 at 07:21
  • @zero323 thank you and Reactormonk; like you said "You can submit jobs asynchronously using futures or parallel maps over functions and force execution but it is usually pointless", can you give an example about parallel maps over functions for this. – steven Jan 05 '16 at 07:48
  • http://docs.scala-lang.org/overviews/parallel-collections/overview.html, http://stackoverflow.com/questions/31912858/processing-multiple-files-as-independent-rdds-in-parallel/31916657#31916657 or solution provided by @SandeepPurohit – zero323 Jan 05 '16 at 11:08

1 Answers1

0

yes you can run multiple jobs at a same time in spark cluster with the help of asynchronous actions like collectAsync(),countAsync() etc.

 yo just set configuration with context .set("spark.scheduler.mode", "FAIR") 

And use asynchronous actions and so all jobs runs asynchronously and its return future so your methods also return future so all methods run at a time.

Sandeep Purohit
  • 3,652
  • 18
  • 22