0

I am running a spark sql job in aws EMR which reads ~100k small JSON files from s3, does a few transformations, and writes results back to s3. I have set shuffle partitions and default parallelism to 20 and executor memory to 4GB. However, for one of the stages javaToPython at NativeMethodAccessorImpl.java, as shown in UI which I understand writes to s3, has nearly 2.7k tasks with input data size < 1MB. Same behavior for the stage with a collect action. I don't understand why? What am I missing here? I have also tested the app by reducing the number of partitions (with coalesce) in the application but nothing seems to change. I am running pyspark 2.4.7 and EMR-5.33.1

Monika
  • 1
  • 3

1 Answers1

0

I think to understand what you're observing you'll need to understand your cluster details and configurations.

Some assumptions in your application:

  • You didn't implement threading in your application

The number of tasks are defined by parallelism (total concurrent tasks)- which is defined by the

  • Number of executors
  • Number of cores for each executor
  • Number of partitions

Please take a read of how stages are split into tasks

With that - another thought is that you mentioned

~100k small JSON files

If you haven't already, make sure to review and understand the Small Files Problem

The section called

  1. Hazards of small files

States

  1. It is easy to cause too many tasks. If it exceeds the configuration of the parameter spark.driver.maxResultSize (default 1g), the following exceptions will be thrown, affecting the processing of tasks Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 478 tasks (2026.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB) Of course, you can increase the default configuration of spark.driver.maxResultSize to solve the problem, but if you can't solve the small file problem from the source, you may encounter similar problems in the future.

In addition, when Spark processes tasks, one partition is assigned a task for processing, and multiple partitions are processed in parallel. Although parallel processing can improve processing efficiency, it does not mean that the more tasks, the better. If the amount of data is small, too many tasks will affect the efficiency.

Understanding how tasks are created (configuration wise and application wise) + what your Spark application is trying to ingest, will help you understand what your application is doing.

Bobshark
  • 376
  • 1
  • 3
  • 12