-1

As per my research whenever we run the spark job we should not run the executors with more than 5 cores, if we increase the cores beyond the limit job will suffer due to bad I/O throughput.

my doubt is if we increase the number of executors and reduce the cores, even then these executors will be ending up in the same physical machine and those executors will be reading from the same disk and writing to the same disk, why will this not cause I/O throughput issue.

can consider Apache Spark: The number of cores vs. the number of executors

use case for reference.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
shashikumar
  • 1
  • 2
  • 4

1 Answers1

0

The core within the executor are like threads. So just like how more work is done if we increase parallelism, we should always keep in mind that there is a limit to it. Because we have to gather the results from those parallel tasks.

Ramzy
  • 6,948
  • 6
  • 18
  • 30
  • I didn't get that, could you please elaborate, my doubt is how by reducing the number of threads within the process and increasing number of processes within the same machine will reduce disk I/O :( – shashikumar Jan 04 '20 at 09:46
  • when u choose more cores, you are increasing the parallelism with the executor. And that is limited because u have to gather results from more threads/cores. So more calls. – Ramzy Jan 06 '20 at 06:43
  • if you have looked at the answer for https://stackoverflow.com/questions/24622108/apache-spark-the-number-of-cores-vs-the-number-of-executors they mention bad HDFS I/O throughput as the problem, but if we run more executors with less cores then multiple threads from different processes which are trying to access blocks from hdfs within same machine should also lead to bad HDFS I/O throughput right, how this is getting solved, i am not understanding this part. – shashikumar Jan 07 '20 at 13:10