How does Apache spark handle python multithread issues?

Question

According to python's GIL we cannot use threading in CPU bound processes so my question is how does Apache Spark utilize python in multi-core environment?

score 11 · Accepted Answer · answered Jun 26 '16 at 12:03

11

Multi-threading python issues are separated from Apache Spark internals. Parallelism on Spark is dealt with inside the JVM.

And the reason is that in the Python driver program, SparkContext uses Py4J to launch a JVM and create a JavaSparkContext.

Py4J is only used on the driver for local communication between the Python and Java SparkContext objects; large data transfers are performed through a different mechanism.

RDD transformations in Python are mapped to transformations on PythonRDD objects in Java. On remote worker machines, PythonRDD objects launch Python sub-processes and communicate with them using pipes, sending the user's code and the data to be processed.

PS: I'm not sure if this actually answers your question completely.

answered Jun 26 '16 at 12:03

eliasah

39,588
11
124
154

3

It think that the main point here is that PySpark doesn't use multi-threading so GIL is simply not an issue. – zero323 Jun 26 '16 at 14:03
@zero323 can you elaborate your comment? – Vahid Hashemi Jun 26 '16 at 17:58
There is not much to elaborate. Excluding tests there are only a few places where PySpark is using threads to perform some secondary tasks like starting external process. Everything else it just a good old single threaded processing. – zero323 Jun 26 '16 at 18:07
I concur with @zero323 that's why I said all the parallel processing is dealt with inside the JVM. – eliasah Jun 26 '16 at 18:46
1

@eliasah To be fair JVM part is not that heavy on multithreading either, don't you think? There are multiple threads required for housekeeping and JVM executors use threads but in practice it is not really required to achieve parallelism in Spark. One could start equivalent number of workers on each machine and get the same parallelism although at a higher price. – zero323 Jun 26 '16 at 19:03
I never said the multithreading is needed to achieve parallelism. I'm not a big fun of that approach neither. – eliasah Jun 26 '16 at 19:26
Spawning up a new thread inside a JVM for each task is much cost-effective and takes less time instead of firing up a new JVM. – ravi malhotra Oct 07 '19 at 09:27
Thanks for your comment @ravimalhotra but I'm not sure how it related to my answer... – eliasah Oct 07 '19 at 09:30
@eliasah I was just going through the comment made by zero323 "One could start equivalent number of workers on each machine and get the same parallelism although at a higher price". In response to that just wanted to say that why JVM uses multithreading – ravi malhotra Oct 07 '19 at 09:35

How does Apache spark handle python multithread issues?

1 Answers1

Linked