0

There are several threads with significant votes that I am having difficulty interpreting, perhaps due to jargon of 2016 being different of that today? (or I am just not getting it, too)

Apache Spark: The number of cores vs. the number of executors

How to tune spark executor number, cores and executor memory?

Azure/Databricks offers some best practices on cluster sizing: https://learn.microsoft.com/en-us/azure/databricks/clusters/cluster-config-best-practices

So for my workload, lets say I am interested in (using Databricks current jargon):

  • 1 Driver: Comprised of 64gb of memory and 8 cores
  • 1 Worker: Comprised of 256gb of memory and 64 cores

Drawing on the above Microsoft link, fewer workers should in turn lead to less shuffle; among the most costly Spark operations.

So, I have 1 driver and 1 worker. How, then, do I translate these terms into what is discussed here on SO in terms of "nodes" and "executors".

Ultimately, I would like to set my Spark config "correctly" such that cores and memory are, as optimized as possible.

John Stud
  • 1,506
  • 23
  • 46
  • 1
    If you are only going to use one worker (executor), why are you using spark? If shuffle is a problem, it should be solved using a more effective partitioning strategy, not by getting rid of parallelized execution. You should configure more executors and split the available memory and CPU's between them. Also 64gb and 8 cores is too much for the driver. – Z4-tier Dec 17 '22 at 01:41
  • As far as I can tell now, despite having 1 "worker", you can assign many executors to that single worker / VM. Any idea why the owners of Databricks / creators of Spark recommend 1 worker in several situations? https://learn.microsoft.com/en-us/azure/databricks/clusters/cluster-config-best-practices – John Stud Dec 17 '22 at 02:09
  • The jargon hasn't changed much. Check out the [official documentation](https://spark.apache.org/docs/latest/cluster-overview.html) first. – Hristo Iliev Dec 17 '22 at 11:37

0 Answers0