1

Is it possible to do a map from a mapper function (i.e from tasks) in pyspark? In other words, is it possible to open "sub tasks" from a task? If so - how do i pass the sparkContext to the tasks - just as a variable?

I would like to have a job that is composed from many tasks - each of these tasks should create many tasks as well, without going back to the driver.

My use case is like this: I am doing a code porting of an application that was written using work queues - to pyspark. In my old application tasks created other tasks - and we used this functionality. I don't want to redesign the whole code because of the move to spark (especially because i will have to make sure that both platform works in the transient phase between the systems)...

Community
  • 1
  • 1
Ofer Eliassaf
  • 2,870
  • 1
  • 17
  • 22

1 Answers1

3

Is it possible to open "sub tasks" from a task?

No, at least not in a healthy manner*.

A task is a command sent from the driver and Spark has as one Driver (central coordinator) that communicates with many distributed workers (executors).

As a result, what you ask for here, implies that every task can play the role of a sub-Driver. Not even a worker, which would have the same faith in my answer as the task.

Remarkable resources:

  1. What is a task in Spark? How does the Spark worker execute the jar file?
  2. What are workers, executors, cores in Spark Standalone cluster?

*With that said, I mean that I am not aware of any hack or something, which if exists would be too specific.

Community
  • 1
  • 1
gsamaras
  • 71,951
  • 46
  • 188
  • 305
  • Thanks for the help - i thought that would be the answer. I played a bit and I actually successfully created other applications (spark contexts) from tasks and things seemed to work fine on small scale cluster . I was afraid that this is a hack and the behavior will be undefined... You solved my dilema. – Ofer Eliassaf Aug 21 '16 at 05:57
  • yes - and it worked - i had some problem in resource allocation since the main driver took all the cpu and the subtask's drivers got starved - but it can be solved using special and complex configurations.... I was afraid that this is too hacky, and that the behavior is undefined. I also wanted to avoid the ugly configuration necessary - this is why i asked about sub tasks of the same application (my idea was to use fair scheduling to avoid the starvation). – Ofer Eliassaf Aug 21 '16 at 06:16
  • @OferE. I understand you, hmm, if I was you, I would give it a shot with more machines, *but* on the same time modify my code to work as a nice, healthy Spark job. I mean haven't heard anybody doing what you are trying to do.. :) – gsamaras Aug 21 '16 at 06:19
  • since the driver is also thread safe - i thought that the tasks can open a communication channel to it - and submit the work to the driver - and use fair scheduling - but this is just too much.... I am doomed :-) – Ofer Eliassaf Aug 21 '16 at 06:22
  • Exactly what I was thinking @OferE., I mean I haven't anybody talking about this, so I am not 100% sure that it cannot be done. However, if you wish to go for it, then you probably have a hard time. The approach in your last comment seems reasonable, but yes this might be too much..Good luck! ;) – gsamaras Aug 21 '16 at 06:27