0

How to change number of parallel tasks in pyspark ?

I mean how to change number of virtual maps that is run on my PC. actually I want to sketch Speed up chart by number of map functions.

sample code:

words = sc.parallelize(["scala","java","hadoop"])\
           .map(lambda word: (word, 1)) \
           .reduceByKey(lambda a, b: a + b)

If you understand my purpose but I asked it in a wrong way I would appreciate if you correct it

Thanks

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Captain
  • 193
  • 1
  • 10
  • On your PC (local execution) or in a cluster? Arguably the former is of no particular interest... – desertnaut Nov 18 '17 at 18:20
  • yes I mean local executation – Captain Nov 18 '17 at 19:02
  • 1
    There is not much meaning in this; in general, if you are going to work on a single machine, you have absolutely no reason to use Spark (beyond toy examples for demonstration purposes, that is, where questions like yours are of no practical use). – desertnaut Nov 18 '17 at 21:23

1 Answers1

1

For this toy example number of parallel tasks will depend on:

  • Number of partition for the input rdd - set by spark.default.parallelism if not configured otherwise.
  • Number of threads assigned to local (might be superseded by the above).
  • Physical and permission-based capabilities of the system.
  • Statistical properties of the dataset.

However Spark is not a lightweight parallelization - for this we have low overhead alternatives like threading and multiprocessing, higher level components built on top of these (like joblib or RxPy) and native extensions (to escape GIL with threading).

Spark itself is heavyweight, with huge coordination and communication overhead, and as stated by by desernaut it is hardly justified for anything than testing, when limited to a single node. In fact, it can make things much worse with higher parallelism

  • Nice points (+1); from the last link: "**Spark is not focused on parallel computing**. Parallel processing is more a side effect of the particular solution than the main goal. Spark is distributed first, parallel second. The main point is to keep processing time constant with increasing amount of data by scaling out, not speeding up existing computations." – desertnaut Nov 19 '17 at 18:35
  • I kindly suggest that you edit your answer to include the above quote... – desertnaut Nov 19 '17 at 19:48