-1

I have a spark job which spins out over 3000 tasks. If each task creates its own database connection, that's a lot of connections, and will not be able to share prepared statement. Does any one know of a way to share connections across spark tasks, let's say just inside each worker node?

Note that this is different from sharing variables using broadcast. A connection created in the master can not be shipped to workers and still work.

bhomass
  • 3,414
  • 8
  • 45
  • 75

1 Answers1

0

TL;DR Probably not.

Python, R:

Workers use separate processes so sharing is not possible.

Java, Scala:

Technically speaking yes. You can define a singleton connection (for example with object or transient lazy variable), but quoting skaffman:

you should avoid sharing connections between threads, since the activity on the connection will mean that only one thread will be able to do anything at a time.

Consequence? Underutilized cluster with database dependent stage as a bottleneck.

What about connection pool? Once again technically speaking possible, the same way as a single machine. But to keep expected cluster utilization you need a connection / core, so there is little to gain.

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115