1

I am trying to run a apache spark direct streaming application in AWS EMR. The application receives and sends data to AWS kinesis and needs to be running the whole time.

If course if a core node is killed, it stops. But it should self-heal when the core node is replaced.

Now I noticed: When I kill one of the core nodes (simulating a problem), it is replaced by AWS EMR. But the application stops working (no output is send to kinesis anymore) and in also does not continue working unless I restart it.

What I get in the logs is:

ERROR YarnClusterScheduler: Lost executor 1 on ip-10-1-10-100.eu-central-1.compute.internal: Slave lost

Which is expected. But then I get:

20/11/02 13:15:32 WARN TaskSetManager: Lost task 193.1 in stage 373.0 (TID 37583, ip-10-1-10-225.eu-central-1.compute.internal, executor 2): FetchFailed(null, shuffleId=186, mapIndex=-1, mapId=-1, reduceId=193, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 186

These are just warnings, still the application does not produce any output anymore.

So I stop the application and start it again. Now it produces output again.

My question: Is AWS EMR suited for a self-healing application, like the one I need? Or am I using the wrong tool? If yes, how do I get my spark application to continue after a core node is replaced?

Nathan
  • 7,099
  • 14
  • 61
  • 125

1 Answers1

0

Its recommended to use On-Demand for CORE instances
And at the same time use TASK instance to leverage SPOT instances.

Have a look

Snigdhajyoti
  • 1,327
  • 10
  • 26