0

I have seen it stated that jobs are idempotent when we write a big data job using MR, Spark, or Tez as the execution engine.

The job first writes data to a temporary directory i.e. ".hivestaging..." or "_temporary"

Then data is merged to its final destination by the FileOutputCommitter with the following criteria:

  • If the destination already has a directory, it is trashed.
  • The directory is moved from temporary location to destination location.

My question is why do we say that the job is idempotent, either it will succeed or fail? Can't there be a case that some data is moved to trash and the job failed moving files from the temporary directory to its destination, thus leading to job failure and loss of data?

Jonathan Myers
  • 930
  • 6
  • 17
dinesh028
  • 2,137
  • 5
  • 30
  • 47

1 Answers1

0

Big data jobs are sometimes idempotent, and sometimes not idempotent. Just like many aspects of programming.

From the following URL What is an idempotent operation?:

In computing, an idempotent operation is one that has no additional effect if it is called more than once with the same input parameters. For example, removing an item from a set can be considered an idempotent operation on the set.

These jobs are not modifying the original input files provided to them, so if the only result is output files, the jobs are idempotent. Calling these jobs on the same files will either fail or provide the same (albeit, potentially differently ordered) results.

However, if your job is doing some manipulation of external sources (such as uploading to a database), repeat runs might add additional data. Thus, it would not be idempotent in this case.

Jonathan Myers
  • 930
  • 6
  • 17