-1

I have a large dataset of 5 million items, consisting of their ID, cost, etc. I have been using sqlContext in the Pyspark shell to load the JSON and create a dataframe and finally, applying all required operations on that dataframe.

I'm new to spark and had a query that whenever I do an operation on my dataframe, whether it be inbuilt functions (eg. Loading the JSON using sqlContext.read.json(filePath) ) or using udf, is it automatically multithreaded or do I need to specify something explicitly to make it multithreaded? If it is multithreaded, how can I view and change the number of threads currently being used?

Aman
  • 9
  • 3

1 Answers1

1

There is no multithreading involved (nor it would be useful) but the execution is parallel by processing partition using separate work processes.

To control parallelism:

  • Adjust number of worker cores.
  • Adjust number of DataFrame partitions (on read or by repartition).
  • Yeah, actually I meant parallel processing instead of multithreaded. So whenever I load a dataset using the above stated command, does spark automatically split the dataframe for me and executes parallely on it? – Aman Jun 12 '18 at 07:12