0

The partitions created for the below word count program is 10, but as per my understanding if we set master("local[2]") while creating a sparksession object means it will run locally with 2 cores i.e 2 partitions

Can someone help me why my spark code is creating 10 partitions instead of creating 2.

CODE :

    SparkSession spark = SparkSession.builder().appName("JavaWordCount").master("local[2]").getOrCreate();

    JavaRDD<String> lines = spark.read().textFile(args[0]).javaRDD();

    JavaRDD<String> words = lines.flatMap(s -> Arrays.asList(SPACE.split(s)).iterator());

    JavaPairRDD<String, Double> pairRDD = words.mapToPair(s -> new Tuple2<>(s, 1.0));

ScreenShort of Spark Web UI :

enter image description here

abaghel
  • 14,783
  • 2
  • 50
  • 66
Naresh
  • 60
  • 1
  • 8
  • 1
    There could just as easily be 5 tasks being done by each core... Those aren't the only values determining the stages https://qubole.zendesk.com/hc/en-us/articles/217111026-Reference-Relationship-between-Partitions-Tasks-Cores – OneCricketeer Dec 23 '18 at 13:56

1 Answers1

0

it will run locally with 2 cores i.e 2 partitions

It doesn't mean that at all. It means Spark can use at most 2 threads for data processing. But with rare exceptions (like parallelize) this has nothing to do with the number of partitions used.

In such simple pipeline the number of partitions will depend on the value of spark.sql.files.maxPartitionBytes parameter.