Spark - filter v. large data (400 GB) in small cluster (16) too long to save to s3

Asked Aug 28 '16 at 06:44

Active Aug 28 '16 at 06:44

Viewed 86 times

I am doing a very simple job in a very large scale.

I have 480 GB in JSON files in an S3 bucket.

val events = spark.read.text("s3a://input/")
val filteredEvents = events filter { _.contains("...") }
filteredEvents.saveAsTextFile("s3a://output/")

After doing a lot of work for ~5 minutes, there is a last task that takes forever. I can see a lot of partial files on the S3 bucket but the job is not finished yet; there is a temporary folder and no success message. I waited for ~20 minutes and no change. Just this one last task that shows a huge scheduler delay.

I suppose this might be the workers sending data back to the scheduler; can't each worker write directly to S3?

My cluster has 16 m3.2xlarge nodes.

Am I trying a job to big with a small cluster?

asked Aug 28 '16 at 06:44

Can you post the stacktrace from the worker that's stuck? – Yuval Itzchakov Aug 28 '16 at 06:45
I will try again tomorrow. I shutdown the cluster already... do you have any ideas in mind? – Aug 28 '16 at 07:00
I ran into a timeout problem writing to S3 with `HttpComponent.HttpClient`. I wonder if it's the same. You can read about it in [this question](http://stackoverflow.com/questions/38606653/spark-stateful-streaming-job-hangs-at-checkpointing-to-s3-after-long-uptime). – Yuval Itzchakov Aug 28 '16 at 08:21

Spark - filter v. large data (400 GB) in small cluster (16) too long to save to s3

0 Answers0