No of partitions set in the program is not matching with the no of partitions showing in spark web ui

Question

The partitions created for the below word count program is 10, but as per my understanding if we set master("local[2]") while creating a sparksession object means it will run locally with 2 cores i.e 2 partitions

Can someone help me why my spark code is creating 10 partitions instead of creating 2.

CODE :

    SparkSession spark = SparkSession.builder().appName("JavaWordCount").master("local[2]").getOrCreate();

    JavaRDD<String> lines = spark.read().textFile(args[0]).javaRDD();

    JavaRDD<String> words = lines.flatMap(s -> Arrays.asList(SPACE.split(s)).iterator());

    JavaPairRDD<String, Double> pairRDD = words.mapToPair(s -> new Tuple2<>(s, 1.0));

ScreenShort of Spark Web UI :

There could just as easily be 5 tasks being done by each core... Those aren't the only values determining the stages https://qubole.zendesk.com/hc/en-us/articles/217111026-Reference-Relationship-between-Partitions-Tasks-Cores — OneCricketeer, Dec 23 '18 at 13:56

score 0 · Answer 1 · answered Dec 23 '18 at 14:36

it will run locally with 2 cores i.e 2 partitions

It doesn't mean that at all. It means Spark can use at most 2 threads for data processing. But with rare exceptions (like parallelize) this has nothing to do with the number of partitions used.

In such simple pipeline the number of partitions will depend on the value of spark.sql.files.maxPartitionBytes parameter.

No of partitions set in the program is not matching with the no of partitions showing in spark web ui

1 Answers1