0

When you create a new AWS EMR cluster through the AWS Management Console you're able to provide JSON Software Configurations. You can put the JSON file in an S3 bucket and point the Software Configurations to the S3 bucket via the following field,

enter image description here

I need to do this through the AWS Python SDK Boto3 library but I don't see where to do it at in the available fields in their example,

response = client.run_job_flow(
    Name='string',
    LogUri='string',
    AdditionalInfo='string',
    AmiVersion='string',
    ReleaseLabel='string',
    Instances={
        'MasterInstanceType': 'string',
        'SlaveInstanceType': 'string',
        'InstanceCount': 123,
        'InstanceGroups': [
            {
                'Name': 'string',
                'Market': 'ON_DEMAND'|'SPOT',
                'InstanceRole': 'MASTER'|'CORE'|'TASK',
                'BidPrice': 'string',
                'InstanceType': 'string',
                'InstanceCount': 123,
                'Configurations': [
                    {
                        'Classification': 'string',
                        'Configurations': {'... recursive ...'},
                        'Properties': {
                            'string': 'string'
                        }
                    },
                ],
                'EbsConfiguration': {
                    'EbsBlockDeviceConfigs': [
                        {
                            'VolumeSpecification': {
                                'VolumeType': 'string',
                                'Iops': 123,
                                'SizeInGB': 123
                            },
                            'VolumesPerInstance': 123
                        },
                    ],
                    'EbsOptimized': True|False
                },
                'AutoScalingPolicy': {
                    'Constraints': {
                        'MinCapacity': 123,
                        'MaxCapacity': 123
                    },
                    'Rules': [
                        {
                            'Name': 'string',
                            'Description': 'string',
                            'Action': {
                                'Market': 'ON_DEMAND'|'SPOT',
                                'SimpleScalingPolicyConfiguration': {
                                    'AdjustmentType': 'CHANGE_IN_CAPACITY'|'PERCENT_CHANGE_IN_CAPACITY'|'EXACT_CAPACITY',
                                    'ScalingAdjustment': 123,
                                    'CoolDown': 123
                                }
                            },
                            'Trigger': {
                                'CloudWatchAlarmDefinition': {
                                    'ComparisonOperator': 'GREATER_THAN_OR_EQUAL'|'GREATER_THAN'|'LESS_THAN'|'LESS_THAN_OR_EQUAL',
                                    'EvaluationPeriods': 123,
                                    'MetricName': 'string',
                                    'Namespace': 'string',
                                    'Period': 123,
                                    'Statistic': 'SAMPLE_COUNT'|'AVERAGE'|'SUM'|'MINIMUM'|'MAXIMUM',
                                    'Threshold': 123.0,
                                    'Unit': 'NONE'|'SECONDS'|'MICRO_SECONDS'|'MILLI_SECONDS'|'BYTES'|'KILO_BYTES'|'MEGA_BYTES'|'GIGA_BYTES'|'TERA_BYTES'|'BITS'|'KILO_BITS'|'MEGA_BITS'|'GIGA_BITS'|'TERA_BITS'|'PERCENT'|'COUNT'|'BYTES_PER_SECOND'|'KILO_BYTES_PER_SECOND'|'MEGA_BYTES_PER_SECOND'|'GIGA_BYTES_PER_SECOND'|'TERA_BYTES_PER_SECOND'|'BITS_PER_SECOND'|'KILO_BITS_PER_SECOND'|'MEGA_BITS_PER_SECOND'|'GIGA_BITS_PER_SECOND'|'TERA_BITS_PER_SECOND'|'COUNT_PER_SECOND',
                                    'Dimensions': [
                                        {
                                            'Key': 'string',
                                            'Value': 'string'
                                        },
                                    ]
                                }
                            }
                        },
                    ]
                }
            },
        ],
        'InstanceFleets': [
            {
                'Name': 'string',
                'InstanceFleetType': 'MASTER'|'CORE'|'TASK',
                'TargetOnDemandCapacity': 123,
                'TargetSpotCapacity': 123,
                'InstanceTypeConfigs': [
                    {
                        'InstanceType': 'string',
                        'WeightedCapacity': 123,
                        'BidPrice': 'string',
                        'BidPriceAsPercentageOfOnDemandPrice': 123.0,
                        'EbsConfiguration': {
                            'EbsBlockDeviceConfigs': [
                                {
                                    'VolumeSpecification': {
                                        'VolumeType': 'string',
                                        'Iops': 123,
                                        'SizeInGB': 123
                                    },
                                    'VolumesPerInstance': 123
                                },
                            ],
                            'EbsOptimized': True|False
                        },
                        'Configurations': [
                            {
                                'Classification': 'string',
                                'Configurations': {'... recursive ...'},
                                'Properties': {
                                    'string': 'string'
                                }
                            },
                        ]
                    },
                ],
                'LaunchSpecifications': {
                    'SpotSpecification': {
                        'TimeoutDurationMinutes': 123,
                        'TimeoutAction': 'SWITCH_TO_ON_DEMAND'|'TERMINATE_CLUSTER',
                        'BlockDurationMinutes': 123
                    }
                }
            },
        ],
        'Ec2KeyName': 'string',
        'Placement': {
            'AvailabilityZone': 'string',
            'AvailabilityZones': [
                'string',
            ]
        },
        'KeepJobFlowAliveWhenNoSteps': True|False,
        'TerminationProtected': True|False,
        'HadoopVersion': 'string',
        'Ec2SubnetId': 'string',
        'Ec2SubnetIds': [
            'string',
        ],
        'EmrManagedMasterSecurityGroup': 'string',
        'EmrManagedSlaveSecurityGroup': 'string',
        'ServiceAccessSecurityGroup': 'string',
        'AdditionalMasterSecurityGroups': [
            'string',
        ],
        'AdditionalSlaveSecurityGroups': [
            'string',
        ]
    },
    Steps=[
        {
            'Name': 'string',
            'ActionOnFailure': 'TERMINATE_JOB_FLOW'|'TERMINATE_CLUSTER'|'CANCEL_AND_WAIT'|'CONTINUE',
            'HadoopJarStep': {
                'Properties': [
                    {
                        'Key': 'string',
                        'Value': 'string'
                    },
                ],
                'Jar': 'string',
                'MainClass': 'string',
                'Args': [
                    'string',
                ]
            }
        },
    ],
    BootstrapActions=[
        {
            'Name': 'string',
            'ScriptBootstrapAction': {
                'Path': 'string',
                'Args': [
                    'string',
                ]
            }
        },
    ],
    SupportedProducts=[
        'string',
    ],
    NewSupportedProducts=[
        {
            'Name': 'string',
            'Args': [
                'string',
            ]
        },
    ],
    Applications=[
        {
            'Name': 'string',
            'Version': 'string',
            'Args': [
                'string',
            ],
            'AdditionalInfo': {
                'string': 'string'
            }
        },
    ],
    Configurations=[
        {
            'Classification': 'string',
            'Configurations': {'... recursive ...'},
            'Properties': {
                'string': 'string'
            }
        },
    ],
    VisibleToAllUsers=True|False,
    JobFlowRole='string',
    ServiceRole='string',
    Tags=[
        {
            'Key': 'string',
            'Value': 'string'
        },
    ],
    SecurityConfiguration='string',
    AutoScalingRole='string',
    ScaleDownBehavior='TERMINATE_AT_INSTANCE_HOUR'|'TERMINATE_AT_TASK_COMPLETION',
    CustomAmiId='string',
    EbsRootVolumeSize=123,
    RepoUpgradeOnBoot='SECURITY'|'NONE',
    KerberosAttributes={
        'Realm': 'string',
        'KdcAdminPassword': 'string',
        'CrossRealmTrustPrincipalPassword': 'string',
        'ADDomainJoinUser': 'string',
        'ADDomainJoinPassword': 'string'
    }
)

How can I provide an S3 bucket location that has the Software Configuration JSON file for creating an EMR cluster through the Boto3 library?

Kyle Bridenstine
  • 6,055
  • 11
  • 62
  • 100

2 Answers2

1

The Configuring Applications - Amazon EMR documentation says:

Supplying a Configuration in the Console

To supply a configuration, you navigate to the Create cluster page and choose Edit software settings. You can then enter the configuration directly (in JSON or using shorthand syntax demonstrated in shadow text) in the console or provide an Amazon S3 URI for a file with a JSON Configurations object.

That seems to be the capability you showed in your question.

The documentation then shows how you can do it via the CLI:

aws emr create-cluster --use-default-roles --release-label emr-5.14.0 --instance-type m4.large --instance-count 2 --applications Name=Hive --configurations https://s3.amazonaws.com/mybucket/myfolder/myConfig.json

This maps to the Configurations options in the JSON you show above:

                    'Configurations': [
                        {
                            'Classification': 'string',
                            'Configurations': {'... recursive ...'},
                            'Properties': {
                                'string': 'string'
                            }
                        },
                    ]

Configurations: A configuration classification that applies when provisioning cluster instances, which can include configurations for applications and software that run on the cluster.

It would contain settings such as:

[
  {
    "Classification": "core-site",
    "Properties": {
      "hadoop.security.groups.cache.secs": "250"
    }
  },
  {
    "Classification": "mapred-site",
    "Properties": {
      "mapred.tasktracker.map.tasks.maximum": "2",
      "mapreduce.map.sort.spill.percent": "0.90",
      "mapreduce.tasktracker.reduce.tasks.maximum": "5"
    }
  }
]

Short answer: Configurations

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
1

Right now the boto3 SDK can't directly import the configuration settings from s3 for you as part of the run_job_flow() function. You would need to setup an S3 client in boto3, download the data as an S3 object and then update the Configuration List part of your EMR dictionary with the JSON data in your S3 file.

An example of how to download a json file from S3 and then load it into memory as a Python Dict can be found over here - Reading an JSON file from S3 using Python boto3

huma474
  • 111
  • 3
  • I like this answer. Just to clarify though, I take this JSON and put it in under `response = client.run_job_flow(...,Configurations=theS3JsonThatWeRetrieved,...)` right? – Kyle Bridenstine Jun 22 '18 at 18:08
  • 1
    Based on how the dictionary for the run_job_flow is built you would need to add those configs to each InstanceGroups element and InstanceFleets element you define, so it would be something like run_job_flow(...,Instances={ 'InstanceGroups':[ { ..., 'Configurations' : theS3JsonThatWeRetrieved,... }],...'InstanceFleets':[{..., 'Configurations' : theS3JsonThatWeRetrieved,...} ]}). Both InstanceGroups and InstanceFleets are optional values to the run_job_flow function, so you'd only need to define the specific values you need – huma474 Jun 23 '18 at 18:25
  • huma474 I'm a little confused what I have to fill out in the InstanceGroups and InstanceFleets because I currently don't have those fields set. Mine currently looks like this `Instances={'MasterInstanceType': 'r4.xlarge','SlaveInstanceType': 'r4.xlarge','InstanceCount': 4,'KeepJobFlowAliveWhenNoSteps': False,'TerminationProtected': False}`. I have the JSON configuration file but I'm not sure if I only have to set the configuration field in the Groups/Fleets or if I have to fill out the other fields? Would you mind showing an example please? – Kyle Bridenstine Jul 11 '18 at 19:17
  • Note to others. It turns out that you DO NOT have to fill out the configuration inside the InstanceGroups or InstanceFleets you can simply fill it out in the configurations field refer here for a reference http://boto3.readthedocs.io/en/latest/reference/services/emr.html#EMR.Client.run_job_flow – Kyle Bridenstine Jul 12 '18 at 16:22