0

I'm trying to figure out if a single task is ever run using all available cores on the executor? Ie, if a stage contains only one task, does that mean the task is a single threaded single core processing on the executor or could the task use all available cores in a multithreaded fashion "under the covers"?

I'm running ETL jobs in Azure Databricks on one worker (hence one executor) and at one point in the pipeline a single job creates a single stage that runs a single task to process the entire dataset. The task takes a few minutes to complete.

I want to understand if a single task can use all available executor cores running functions in parallell or not? In this case I deserialize json messages with the from_json function and save them as parquet files. I'm worried this is a single threaded process going on in the single task.

spark
    .read
    .table("input")
    .withColumn("Payload", from_json($"Payload", schema))
    .write
    .mode(SaveMode.Append)
    .saveAsTable("output")
Molotch
  • 365
  • 7
  • 20
  • 1
    possible dup of [Number of CPUs per Task in Spark](https://stackoverflow.com/q/36671832/647053) see the accepted answer. – Ram Ghadiyaram Jul 19 '19 at 16:58
  • I saw that answer. As I understand it spark.task.cpu and spark.cores.max actually just limits the amount tasks created on the driver side, it does not decide the behaviour of a task on the executor. – Molotch Jul 22 '19 at 09:06

1 Answers1

1

If you are looking the Spark UI and see only one task, this is definitely single cored and single threaded.

For instance, if you do a join and then a trasnformation, you will see anything like 200 tasks by default. Which means 200 "thread" are computing in parallel.

If you want to check the number of executors, you can click on the stages tab, click on any stage and you will see how many executors were used.

BlueSheepToken
  • 5,751
  • 3
  • 17
  • 42
  • 1
    Thanks, I only have one executor with 8 cores. If there's only a single task in a stage I can be sure a single core is doing all the work. It's weird, the task input size is 5,7 GB, I can't understand how all that data ended up in a single partition. Have to try to repartition the dataset before I run this part of the pipeline. – Molotch Jul 22 '19 at 10:29
  • 1
    @Molotch Exactly ! The partitionning is usually a solution to explore – BlueSheepToken Jul 22 '19 at 12:07
  • 1
    They may not all be computing parallel unless you have 200 executors at your disposal, that was my understanding. – thebluephantom Jul 22 '19 at 12:38
  • @BlueSheepToken I think I found the problem too, coalesce(1) might be pushed up for some reason I haven't understood yet. https://stackoverflow.com/questions/44494656/how-to-prevent-spark-optimization – Molotch Jul 22 '19 at 13:57
  • @Molotch Thanks for getting me informed ! I do not know about this particular optimization, but yeah, Spark does some pushdown predicates optimizations. If you have the plans I might dig a little bit more :) – BlueSheepToken Jul 22 '19 at 14:04