1

I have a spark application which loads data from csv files, calls Drools engine, uses flatmap and saves results to output files (.csv)

Below are the two test cases :

1) When I am running this Application with 2 Worker nodes having same configuration (8 cores) Application takes 5.2 minutes to complete :

  • No. Of Executors : 133

  • Total No of Cores : 16

  • Used Memory : 2GB( 1GB per executor)

  • Available Memory : 30GB

2) When I am running this Application with 3 Worker nodes having same configuration (8 cores) Application takes 7.6 minutes to complete :

  • No. Of Executors : 266
  • Total No of Cores : 24
  • Used Memory : 3GB( 1GB per executor)
  • Available Memory : 45GB

Expected result

It should take less time after adding one more worker node with same configuration.

Actual Result

It takes more time after adding one more worker node with same configuration.

I am running application using spark-submit command in standalone mode.

Here I want to understand that why increasing worker node doesn't increase the performance, is that not the correct expectation?

EDIT

After looking at other similar question on stackoverflow I tried running application with spark.dynamicAllocation.enabled=true, however it further degrades the performance.

Raj
  • 707
  • 6
  • 23
  • [Spark: Inconsistent performance number in scaling number of cores](https://stackoverflow.com/q/41090127/8371915) – Alper t. Turker Jul 03 '18 at 10:37
  • @user8371915 Thanks for pointing out, however problem in that question is having same machine and increased number of cores(that's why in solution he mentioned multithreading on single machine), in my case machines are multiple(worker nodes are 2 in first case and 3 in second case) , is that solution still remain applicable in my case? – Raj Jul 03 '18 at 12:06
  • Did you find a solution? I am facing similar problems. – G.M Jun 02 '23 at 13:15

0 Answers0