Volatile variables in Spark

Question

How do volatile variables work when using multithreading inside of Spark?

I have a multithreaded process that uses a volatile total variable to keep track of a sum across multiple threads. This variable and all methods being executed are static. I am curious as to how this variable would behave if I had multiple spark workers executing separate instances of this process in parallel.

Will each of them have their own total variable or will it be shared across worker nodes?

EDIT: The reason I want to multithread and use spark is that my program is a Genetic Algorithm that flows as such: Distribute n populations to Spark, ideally 1 population per worker. Each population has 10-100 "individuals." For each individual, calculate its fitness by running the multithreaded process 100 times (each iteration has a small parameter change) and return a function of the total of the iterations.

The multithreaded process takes a long time so I would like to speed it up in any way possible.

This sounds very much like an A/B question. What is the top-level goal you are trying to achieve ? Spark should abstract you away from this kind of considerations — Dici, May 10 '16 at 01:08
@Dici I have a large process that I am running a lot of times. A portion of this process has been multithreaded to speed it up. I also have a way of having Spark running separate instances of this process on each worker node. I am curious as to how Spark will handle the already multithreaded code. — jbird, May 11 '16 at 17:45
Variables are local to a process. If you have multiple processes, you have multiple unrelated instances (not copies) of the variable. Spark doesn't change that. — vanza, May 11 '16 at 22:45

score 0 · Accepted Answer · edited May 23 '17 at 11:51

0

Alright I think I figured this out by combining the comment by @vanza and the answer here.

Essentially, each worker node will have its own instance of the class performing the multithreaded process, so there is no chance they will overlap. This is actually pretty intuitive, since if my worker nodes are on different machines, they won't be sharing variables between themselves.

edited May 23 '17 at 11:51

Community

1
1

answered May 11 '16 at 23:36

jbird

506
6
21

I still think you should let Spark parallelize your work. It probably depends on your exact use case, but as a rule of thumb I would say that if I ask Spark to run x tasks on y cores with `x < y`, then Spark should always be able to use y cores, potentially at their limit. In other words, the number of cores should define the level of parallelism and a partition (task) should be the smallest work unit – Dici May 12 '16 at 23:20
Yes that is a really good point and I do believe that it is the best practice in the general case. However, in a Genetic Algorithm (my specific use case), the most optimal form of parallelization would be to give a population to each worker node and then do work there (see [here](http://www.genetic-programming.com/parallel.html)). In this case however, my fitness function utilizes multiple threads to significantly speed up its process, so both levels of parallelization would be necessary to achieve the best performance. – jbird May 17 '16 at 15:55
Genetic algorithm on Spark :p sounds fun – Dici May 17 '16 at 16:59

Volatile variables in Spark

1 Answers1