24

Every job script code should be ended with job.commit() but what exact action this function do?

  1. Is it just job end marker or not?
  2. Can it be called twice during one job (if yes - in what cases)?
  3. Is it safe to execute any python statement after job.commit() is called?

P.S. I have not found any description in PyGlue.zip with aws py source code :(

Cherry
  • 31,309
  • 66
  • 224
  • 364

3 Answers3

23

As of today, the only case where the Job object is useful is when using Job Bookmarks. When you read files from Amazon S3 (only supported source for bookmarks so far) and call your job.commit, a time and paths read so far will be internally stored, so that if for some reason you attempt to read that path again, you will only get back unread (new) files.

In this code sample, I try to read and process two different paths separately, and commit after each path is processed. If for some reason I stop my job, the same files won't be processed.

args = getResolvedOptions(sys.argv, [‘TempDir’,’JOB_NAME’])
sc = SparkContext()
glue_context = GlueContext(sc)
# Init my job
job = Job(glue_context)
job.init(args[‘JOB_NAME’], args)

paths = [
    's3://bucket-name/my_partition=apples/',
    's3://bucket-name/my_partition=oranges/']
# Read each path individually, operate on them and commit
for path in paths:
    try:
        dynamic_frame = glue_context.create_dynamic_frame_from_options(
            connection_type='s3',
            connection_options={'paths'=[s3_path]},
            format='json',
            transformation_ctx="path={}".format(path))
        do_something(dynamic_frame)
        # Commit file read to Job Bookmark
        job.commit()
    except:
        # Something failed

Calling the commit method on a Job object only works if you have Job Bookmark enabled, and the stored references are kept from JobRun to JobRun until you reset or pause your Job Bookmark. It is completely safe to execute more python statements after a Job.commit, and as shown on the previous code sample, committing multiple times is also valid.

Hope this helps

hoaxz
  • 701
  • 6
  • 11
  • I can confirm. I am reading from another db and table and with job bookmark enabled, the job fails on subsequent runs. This is how I came to this stack overflow question. Does the bookmark only track which partitions have been read in a hive formatted path (for example `/my_partition=apples/`) or does it keep track of which folders it has read inside the partition folder as well. – stewart99 Mar 05 '18 at 21:29
  • @doorfly technically all files are inside the bucket at the same level (prefixes are used to index files, but the concept of folders doesn't exist within S3). With that being said, bookmarks will read any new files (doesn't matter which prefix they have) based on the timestamp of the file. – hoaxz Mar 06 '18 at 01:55
  • yes I know s3 doesn't have "folders"; it was for brevity. That said, I can't seem to get job bookmarking to work. There doesn't seem to be a way to get the bookmark position. There is a reset-job-bookmark in the API, but not something like `get-job-bookmark` which would help with debugging. – stewart99 Mar 06 '18 at 23:05
  • @doorfly, I'd love to dig deeper into your scenario. Can you show me a code sample of how you're reading your data from the S3 bucket? – hoaxz Mar 07 '18 at 00:28
  • here is the snippet: `years = [2017, 2018] months = range(1,13) days = range(1,32) glue0 = glueContext.create_dynamic_frame.from_options(connection_type='s3', connection_options={'paths': ['s3://dev-bucket/aws-glue/data/{}/{:02d}/{:02d}'.format(y, m, d) for y in years for m in months for d in days]}, format='json')` Not all paths are actually there since I enumerated over all months and days in 2017 and 2018. The api seem to ignore those that are not found and the job finishes executing, just not with any sort of bookmarking behavior that I could see. @hoaxz – stewart99 Mar 08 '18 at 05:03
  • 1
    there is something wrong with your code sample. In the call `glue0 = glueContext.create_dynamic_frame.from_options(connection_type='s3', ...)` the parameter `transformation_ctx="some context here"` must be added so the job bookmark feature works. I feel like the api should have thrown an error if the `transformation_ctx` was not provided or provided a default one. AFAIK the value to that parameter is just a string and can be any value. @hoaxz – stewart99 Mar 08 '18 at 15:18
8

To expand on @yspotts answer. It is possible to execute more than one job.commit() in an AWS Glue Job script, although the bookmark will be updated only once, as they mentioned. However, it is also safe to call job.init() more than once. In this case, the bookmarks will be updated correctly with the S3 files processed since the previous commit. If false, it does nothing.

In the init() function, there is an "initialised" marker that gets updated and set to true. Then, in the commit() function this marker is checked, if true then it performs the steps to commit the bookmarker and reset the "initialised" marker.

So, the only thing to change from @hoaxz answer would be to call job.init() in every iteration of the for loop:

args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME'])
sc = SparkContext()
glue_context = GlueContext(sc)
# Init my job
job = Job(glue_context)

paths = [
    's3://bucket-name/my_partition=apples/',
    's3://bucket-name/my_partition=oranges/']
# Read each path individually, operate on them and commit
for s3_path in paths:
    job.init(args[‘JOB_NAME’], args)
    dynamic_frame = glue_context.create_dynamic_frame_from_options(
        connection_type='s3',
        connection_options={'paths'=[s3_path]},
        format='json',
        transformation_ctx="path={}".format(path))
    do_something(dynamic_frame)
    # Commit file read to Job Bookmark
    job.commit()
lxop
  • 7,596
  • 3
  • 27
  • 42
nanodgb
  • 148
  • 2
  • 9
4

According to the AWS support team, commit should not be called more than once. Here is the exact response I got from them:

The method job.commit() can be called multiple times and it would not throw any error as well. However, if job.commit() would be called multiple times in a Glue script then job bookmark will be updated only once in a single job run that would be after the first time when job.commit() gets called and the other calls for job.commit() would be ignored by the bookmark. Hence, job bookmark may get stuck in a loop and would not able to work well with multiple job.commit(). Thus, I would recommend you to use job.commit() once in the Glue script.

User
  • 4,023
  • 4
  • 37
  • 63
yspotts
  • 73
  • 1
  • 6