I am doing a very simple job in a very large scale.
I have 480 GB in JSON files in an S3 bucket.
val events = spark.read.text("s3a://input/")
val filteredEvents = events filter { _.contains("...") }
filteredEvents.saveAsTextFile("s3a://output/")
After doing a lot of work for ~5 minutes, there is a last task that takes forever. I can see a lot of partial files on the S3 bucket but the job is not finished yet; there is a temporary folder and no success message. I waited for ~20 minutes and no change. Just this one last task that shows a huge scheduler delay.
I suppose this might be the workers sending data back to the scheduler; can't each worker write directly to S3?
My cluster has 16 m3.2xlarge nodes.
Am I trying a job to big with a small cluster?