I have seen it stated that jobs are idempotent when we write a big data job using MR, Spark, or Tez as the execution engine.
The job first writes data to a temporary directory i.e. ".hivestaging..." or "_temporary"
Then data is merged to its final destination by the FileOutputCommitter with the following criteria:
- If the destination already has a directory, it is trashed.
- The directory is moved from temporary location to destination location.
My question is why do we say that the job is idempotent, either it will succeed or fail? Can't there be a case that some data is moved to trash and the job failed moving files from the temporary directory to its destination, thus leading to job failure and loss of data?