having Spark process partitions concurrently, using a single dev/test machine

Question

I'm naively testing for concurrency in local mode, with the following spark context

SparkSession
      .builder
      .appName("local-mode-spark")
      .master("local[*]")
      .config("spark.executor.instances", 4)
      .config("spark.executor.cores", 2)
      .config("spark.network.timeout", "10000001") // to avoid shutdown during debug, avoid otherwise
      .config("spark.executor.heartbeatInterval", "10000000") // to avoid shutdown during debug, avoid otherwise
      .getOrCreate()

and a mapPartitions API call like follows:

import spark.implicits._ 

val inputDF : DataFrame = spark.read.parquet(inputFile)

val resultDF : DataFrame =
    inputDF.as[T].mapPartitions(sparkIterator => new MyIterator)).toDF

On the surface of it, this did surface one concurrency bug in my code contained in MyIterator (not a bug in Spark's code). However, I'd like to see that my application will crunch all available machine resources both in production, and also during this testing so that the chances of spotting additional concurrency bugs will improve.

That is clearly not the case for me so far: my machine is only at very low CPU utilization throughout the heavy processing of the inputDF, while there's plenty of free RAM and the JVM Xmx poses no real limitation.

How would you recommend testing for concurrency using your local machine? the objective being to test that in production, Spark will not bump into thread-safety or other concurrency issues in my code applied by spark from within MyIterator?

Or can it even in spark local mode, process separate partitions of my input dataframe in parallel? Can I get spark to work concurrently on the same dataframe on a single machine, preferably in local mode?

My preference for local mode is that I guess it is the only mode where a main can use spark without depending on any installation of spark, only relying on spark as a library dependency — matanster, Aug 27 '19 at 14:36
could you please specify your data size on your local machine ie inputFile ?? — maogautam, Aug 27 '19 at 20:14
@Prateek is the number of partitions a property preserved in a parquetized dataframe? I am reading a dataframe from parquet, 100,000+ records. how does the size come into play in this? — matanster, Aug 28 '19 at 12:39
See https://stackoverflow.com/questions/44590284/number-of-executors-in-spark-local-mode — thebluephantom, Aug 28 '19 at 13:55
See https://stackoverflow.com/questions/29011574/how-does-spark-partitioning-work-on-files-in-hdfs — maogautam, Aug 28 '19 at 21:53

score 6 · Answer 1 · answered Aug 28 '19 at 13:38

Max parallelism

You are already running spark in local mode using .master("local[*]").

local[*] uses as many threads as the number of processors available to the Java virtual machine (it uses Runtime.getRuntime.availableProcessors() to know the number).

Max memory available to all executors/threads

I see that you are not setting the driver memory explicitly. By default the driver memory is 512M. If your local machine can spare more than this, set this explicitly. You can do that by either:

setting it in the properties file (default is spark-defaults.conf),
```
spark.driver.memory              5g
```
or by supplying configuration setting at runtime
```
$ ./bin/spark-shell --driver-memory 5g
```

Note that this cannot be achieved by setting it in the application, because it is already too late by then, the process has already started with some amount of memory.

Nature of Job

Check number of partitions in your dataframe. That will essentially determine how much max parallelism you can use.

inputDF.rdd.partitions.size

If the output of this is 1, that means your dataframe has only 1 partition and so you won't get concurrency when you do operations on this dataframe. In that case, you might have to tweak some config to create more number of partitions so that you can concurrently run tasks.

DaRkMaN · Answer 2 · 2019-09-01T04:08:57.780

Running local mode cannot simulate a production environment for the following reasons.

There are lots of code which gets bypassed when code is run in local mode, which would normally run with any other cluster manager. Amongst various issues, few things that i could think
a. Inability to detect bugs from the way shuffle get handled.(Shuffle data is handled in a completely different way in local mode.)
b. We will not be able to detect serialization related issues, since all code is available to the driver and task runs in the driver itself, and hence we would not result in any serialization issues.
c. No speculative tasks(especially for write operations)
d. Networking related issues, all tasks are executed in same JVM. One would not be able detect issues like communication between driver/executor, codegen related issues.
Concurrency in local mode
a. Max concurrency than can be attained will be equal to the number of cores in your local machine.(Link to code)
b. The Job, Stage, Task metrics shown in Spark UI are not accurate since it will incur the overhead of running in the JVM where the driver is also running.
c: As for CPU/Memoryutilization, it depends on operation being performed. Is the operation CPU/memory intensive?
When to use local mode
a. Testing of code that will run only on driver
b. Basic sanity testing of the code that will get executed on the executors
c. Unit testing

tl; dr The concurrency bugs that occur in local mode might not even be present in other cluster resource managers, since there are lot of special handling in Spark code for local mode(There are lots of code which checks isLocal in code and control goes to a different code flow altogether)

Wow. Thank you for the great delineation. I guess it's all about (optional step 1) stress testing my components for concurrency without Spark at all and (step 2) simply testing on a real "staging" Spark Cluster. Unless you do it very differently for testing a single spark Job involving lots of moving parts? — matanster, Sep 01 '19 at 03:51
@matanster Right.. 1. Testing in local mode is as good as testing in a standalone java appl, where you would have better control and easier to simulate scenarios with java thread apis. 2. Testing on staging cluster is better since concurrency to an extent depends on the way the underlying resource manager(YARN, Mesos, K8s, standalone etc) handles task execution. This would avoid handling issues that which occur in local mode, would not be an issue in the real environment and also gives more confidence as we have already tested on actual env. — DaRkMaN, Sep 01 '19 at 12:26
contd.. My recommendation would be to test on a staging cluster(Maybe with minimal resources) — DaRkMaN, Sep 01 '19 at 12:26

score 2 · Answer 3 · answered Aug 28 '19 at 15:54

Yes! Achieving parallelism in local mode is quite possible. Check the amount of memory and cpu available in your local machine and supply values to the driver-memory and driver-cores conf while submitting your spark job.

Increasing executor-memory and executor-cores will not make a difference in this mode.

Once the application is running, open up the SPARK UI for the job. You can now go to the EXECUTORS tab to actually check the amount of resources your spark job is utilizing.

You can monitor various tasks that get generated and the number of tasks that your job runs concurrently using the JOBS and STAGES tab.

In order to process data which is way larger than the resources available, ensure that you break your data into smaller partitions using repartition. This should allow your job to complete successfully.

Increase the default shuffle partitions in case your job has aggregations or joins. Also, ensure sufficient space on the local file system since spark creates intermediate shuffle files and writes them to disk.

Hope this helps!

having Spark process partitions concurrently, using a single dev/test machine

3 Answers3