26

I am relatively new to AWS and this may be a bit less technical question, but at present AWS Glue notes a maximum of 25 jobs permitted to be created. We are loading in a series of tables that each have their own job that subsequently appends audit columns. Each job is very similar, but simply changes the connection string source and target.

Is there a way to parameterize these jobs to allow for reuse and simply pass the proper connection strings to them? Or even possibly loop through a set connection strings in a master job that would call a child job passing the varying connection strings through?

Any examples or documentation would be most appreciated

Sauron
  • 6,399
  • 14
  • 71
  • 136
  • AWS Support will often lift their service maximums (in this case 25) on request. You might try that first. – RobinL Sep 13 '18 at 15:16
  • @RobinL But is there a more efficient way to code what we are trying? Any examples would be great – Sauron Sep 13 '18 at 15:17

1 Answers1

48

In the below example I present how to use Glue job input parameters in the code. This code takes the input parameters and it writes them to the flat file.

  1. Setting the input parameters in the job configuration.

enter image description here

  1. The code of Glue job
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
 
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
 
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
args = getResolvedOptions(sys.argv, ['JOB_NAME','VAL1','VAL2','VAL3','DEST_FOLDER'])
job.init(args['JOB_NAME'], args)

v_list=[{"VAL1":args['VAL1'],"VAL2":args['VAL2'],"VAL3":args['VAL3']}]

df=sc.parallelize(v_list).toDF()
df.repartition(1).write.mode('overwrite').format('csv').options(header=True, delimiter = ';').save("s3://"+ args['DEST_FOLDER'] +"/")

job.commit()
  1. There is also possible to provide input parameters during using boto3, CloudFormation or StepFunctions. This example shows how to do it by using boto3.
import boto3
    
def lambda_handler(event, context):
    glue = boto3.client('glue')
        
        
    myJob = glue.create_job(Name='example_job2', Role='AWSGlueServiceDefaultRole',
                            Command={'Name': 'glueetl','ScriptLocation': 's3://aws-glue-scripts/example_job'},
                            DefaultArguments={"VAL1":"value1","VAL2":"value2","VAL3":"value3"}       
                                   )
    glue.start_job_run(JobName=myJob['Name'], Arguments={"VAL1":"value11","VAL2":"value22","VAL3":"value33"})

Useful links:

  1. https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-get-resolved-options.html
  2. https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-calling.html
  3. https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.create_job
  4. https://docs.aws.amazon.com/step-functions/latest/dg/connectors-glue.html
Maurice
  • 11,482
  • 2
  • 25
  • 45
jbgorski
  • 1,824
  • 9
  • 16
  • 1
    The job parameters -section is not available in my "Edit job". Maybe that's not available for all job types? – Jari Turkia May 11 '20 at 10:53
  • 3
    I found the parameters. It was well hidden in Job defintion. – Jari Turkia May 12 '20 at 06:32
  • Hi just wondering if `crawler` is required or optional for creating a job? – wawawa Feb 07 '21 at 15:38
  • Thanks! This is a very straightforward answer! – Joabe Lucena Apr 01 '21 at 14:52
  • 1
    @Cecilia , It's not necessary to create a crawler for Glue job. – mohit sharma Sep 30 '21 at 12:14
  • Ironically, the job parameters (which you may wish to change run to run) are visible in the job GUI, but to edit them, you have to start running the job and apparently it saves the last parameters used as the default. This is really terrible, because presumably you would want to run with different parameters at different times or even maintain different params for the same job in different workflows, which seems impossible (so much for DRY). Even in their happy path where only one workflow uses a job and params never change, it's hard to find them to edit. AWS really messed this up, AFAICT. :( – combinatorist Dec 06 '21 at 15:27
  • 1
    OK, thanks, @JariTurkia! With your encouragement, I was able to find the job params hidden in "Action" (top of page) > "Edit Job" (bottom of pop-up) > "Security configuration, script libraries, and job parameters (optional)" (bottom of pop-up again). So, at least you can change them statically without running, but you still can only set one version of the params per job, which prevents reuse in different workflows or different environments (at least via the GUI). – combinatorist Dec 06 '21 at 15:35