-1

Im using a python spark workflow that executes multiples tasks. Gets some data from db, process, loads back to db.

All these are separate tasks but each of these the dababase connection object.

How do I pass the db connection between multiple executors? Do I need to make a separate connection in each of its own tasks or can I distribute the connection?

user1050619
  • 19,822
  • 85
  • 237
  • 413

1 Answers1

2

You shouldn't pass a db connection between multiple executors since they are going to run on different workers that could be on different machines.

It seems that some people manage to initialize one connection per jvm in Scala/Java -- Spark-streaming-and-connection-pool-implementation

Franzi
  • 1,791
  • 23
  • 21
  • @Franzi: Thanks, Is there a pyspark implementation of this? I could not find one. – user1050619 Nov 01 '17 at 15:48
  • As they point here https://stackoverflow.com/a/38268367/1916298, pyspark use separate processes. So, each worker will have its own process. That means that sharing python connections between workers is not possible. Only if you enable the `spark.python.worker.reuse` that will allow Spark to reuse workers, but still, you will need to create a different connection between workers. – Franzi Nov 01 '17 at 20:36