Parallel processing in Pyspark

Question

I have a large dataset of 5 million items, consisting of their ID, cost, etc. I have been using sqlContext in the Pyspark shell to load the JSON and create a dataframe and finally, applying all required operations on that dataframe.

I'm new to spark and had a query that whenever I do an operation on my dataframe, whether it be inbuilt functions (eg. Loading the JSON using sqlContext.read.json(filePath) ) or using udf, is it automatically multithreaded or do I need to specify something explicitly to make it multithreaded? If it is multithreaded, how can I view and change the number of threads currently being used?

score 1 · Answer 1 · answered Jun 11 '18 at 20:20

1

There is no multithreading involved (nor it would be useful) but the execution is parallel by processing partition using separate work processes.

To control parallelism:

Adjust number of worker cores.
Adjust number of DataFrame partitions (on read or by repartition).

answered Jun 11 '18 at 20:20

user9927076

11
1

Yeah, actually I meant parallel processing instead of multithreaded. So whenever I load a dataset using the above stated command, does spark automatically split the dataframe for me and executes parallely on it? – Aman Jun 12 '18 at 07:12

Parallel processing in Pyspark

1 Answers1