0

I am doing a very simple job in a very large scale.

I have 480 GB in JSON files in an S3 bucket.

val events = spark.read.text("s3a://input/")
val filteredEvents = events filter { _.contains("...") }
filteredEvents.saveAsTextFile("s3a://output/")

After doing a lot of work for ~5 minutes, there is a last task that takes forever. I can see a lot of partial files on the S3 bucket but the job is not finished yet; there is a temporary folder and no success message. I waited for ~20 minutes and no change. Just this one last task that shows a huge scheduler delay.

I suppose this might be the workers sending data back to the scheduler; can't each worker write directly to S3?

My cluster has 16 m3.2xlarge nodes.

Am I trying a job to big with a small cluster?

  • Can you post the stacktrace from the worker that's stuck? – Yuval Itzchakov Aug 28 '16 at 06:45
  • I will try again tomorrow. I shutdown the cluster already... do you have any ideas in mind? –  Aug 28 '16 at 07:00
  • I ran into a timeout problem writing to S3 with `HttpComponent.HttpClient`. I wonder if it's the same. You can read about it in [this question](http://stackoverflow.com/questions/38606653/spark-stateful-streaming-job-hangs-at-checkpointing-to-s3-after-long-uptime). – Yuval Itzchakov Aug 28 '16 at 08:21

0 Answers0