2

TL;DR

Is there a way to timeout a pyspark job? I want a spark job running in cluster mode to be killed automatically if it runs longer than a pre-specified time.

Longer Version:

The cryptic timeouts listed in the documentation are at most 120s, except one which is infinity, but this one is only used if spark.dynamicAllocation.enabled is set to true, but by default (I havent touch any config parameters on this cluster) it is false.

I want to know because I have a code that for a particular pathological input will run extremely slow. For expected input the job will terminate in under an hour.Detecting the pathological input is as hard as trying to solve the problem, so I don't have the option of doing clever preprocessing. The details of the code are boring and irrelevant, so I'm going to spare you having to read them =)

Im using pyspark so I was going to decorate the function causing the hang up like this but it seems that this solution doesnt work in cluster mode. I call my spark code via spark-submit from a bash script, but so far as I know bash "goes to sleep" while the spark job is running and only gets control back once the spark job terminates, so I don't think this is an option.

Actually, the bash thing might be a solution if I did something clever but I'd have to get the driver id for the job like this, and by now I'm thinking "this is too much thought and typing for something so simple as a timeout which ought to be built in."

Community
  • 1
  • 1
  • The more details you share the better chance we'll be able to help you. –  Oct 24 '16 at 22:17
  • I just want to know if there is a configuration parameter somewhere that automatically kills a spark job running in cluster mode if it runs for longer than some specified time. –  Oct 25 '16 at 00:11
  • I added some more relevant information about things I've tried! –  Oct 25 '16 at 00:19

1 Answers1

-2

You can set a classic python alarm. Then in handler function you can raise exception or use sys.exit() function to finish driver code. As driver finishes, YARN kills whole application.

You can find example usage in documentation: https://docs.python.org/3/library/signal.html#example

Mariusz
  • 13,481
  • 3
  • 60
  • 64
  • I can give it another try. I tried to follow the example here http://stackoverflow.com/questions/2281850/timeout-function-if-it-takes-too-long-to-finish but this wasnt working. –  Nov 03 '16 at 15:25