As of today, the only case where the Job object is useful is when using Job Bookmarks. When you read files from Amazon S3 (only supported source for bookmarks so far) and call your job.commit
, a time and paths read so far will be internally stored, so that if for some reason you attempt to read that path again, you will only get back unread (new) files.
In this code sample, I try to read and process two different paths separately, and commit after each path is processed. If for some reason I stop my job, the same files won't be processed.
args = getResolvedOptions(sys.argv, [‘TempDir’,’JOB_NAME’])
sc = SparkContext()
glue_context = GlueContext(sc)
# Init my job
job = Job(glue_context)
job.init(args[‘JOB_NAME’], args)
paths = [
's3://bucket-name/my_partition=apples/',
's3://bucket-name/my_partition=oranges/']
# Read each path individually, operate on them and commit
for path in paths:
try:
dynamic_frame = glue_context.create_dynamic_frame_from_options(
connection_type='s3',
connection_options={'paths'=[s3_path]},
format='json',
transformation_ctx="path={}".format(path))
do_something(dynamic_frame)
# Commit file read to Job Bookmark
job.commit()
except:
# Something failed
Calling the commit method on a Job
object only works if you have Job Bookmark enabled, and the stored references are kept from JobRun to JobRun until you reset or pause your Job Bookmark. It is completely safe to execute more python statements after a Job.commit
, and as shown on the previous code sample, committing multiple times is also valid.
Hope this helps