I have a Hadoop job that processes log files and reports some statistics. This job died about halfway through the job because it ran out of file handles. I have fixed the issue with the file handles and am wondering if it is possible to restart a "killed" job.
-
Are you speaking of starting the whole job or of a particular node's task? If some nodes completed, then you should have their output and run on the complement of tasks that did not complete. In practice, though, I find it better to rerun the whole thing. If there's one problem, there could well be others, and it's rarely worthwhile to me to sift through a lot of detritus to figure out what's salvageable. – Iterator Feb 16 '12 at 18:41
-
If you wanted it to happen automatically, it seems like the job would have to be designed for this. That might be worthwhile in certain cases. If you could fire it back up and it could figure out, oh, I've already completed that piece, then it could skip it. – Don Branson Feb 16 '12 at 19:14
-
I thinking about the whole job, it was about a third through and I was hoping not to lose that work. I see what you're saying about trying to extract the unprocessed data, at that point it'd probably be easier to re-run the job. More than anything I wanted to make sure that I wasn't overlooking a function that would let me re-start a killed job. – Miles Feb 17 '12 at 01:39
1 Answers
As it turns out, there is not a good way to do this; once a job has been killed there is no way to re-instantiate that job and re-start processing immediately prior to the first failure. There are likely some really good reasons for this but I'm not qualified to speak to this issue.
In my own case, I was processing a large set of log files and loading these files into an index. Additionally I was creating a report on the contents of these files at the same time. In order to make the job more tolerant of failures on the indexing side (a side-effect, this isn't related to Hadoop at all) I altered my job to instead create many smaller jobs, each one of these jobs processing a chunk of these log files. When one of these jobs finishes, it renames the processed log files so that they are not processed again. Each job waits for the previous job to complete before running.
When one job fails, all of the subsequent jobs quickly fail afterward. Simply fixing whatever the issue was and the re-submitting my job will, roughly, pick up processing where it left off. In the worst-case scenario where a job was 99% complete at the time of it's failure, that one job will be erroneously and wastefully re-processed.