TL;DR
Is there a way to timeout a pyspark job? I want a spark job running in cluster mode to be killed automatically if it runs longer than a pre-specified time.
Longer Version:
The cryptic timeouts listed in the documentation are at most 120s, except one which is infinity, but this one is only used if spark.dynamicAllocation.enabled is set to true, but by default (I havent touch any config parameters on this cluster) it is false.
I want to know because I have a code that for a particular pathological input will run extremely slow. For expected input the job will terminate in under an hour.Detecting the pathological input is as hard as trying to solve the problem, so I don't have the option of doing clever preprocessing. The details of the code are boring and irrelevant, so I'm going to spare you having to read them =)
Im using pyspark so I was going to decorate the function causing the hang up like this but it seems that this solution doesnt work in cluster mode. I call my spark code via spark-submit from a bash script, but so far as I know bash "goes to sleep" while the spark job is running and only gets control back once the spark job terminates, so I don't think this is an option.
Actually, the bash thing might be a solution if I did something clever but I'd have to get the driver id for the job like this, and by now I'm thinking "this is too much thought and typing for something so simple as a timeout which ought to be built in."