1

I am new to AWS and want to run a python work script that is embarrassingly parallel on an EC2-instance (e.g. c4.4xlarge).

I have gone through questions on the topic, but have not found a high-level answer to the steps I need to take. I have AWS credentials and have boto3 installed on my laptop's python 2.

How do I structure a python submission script that:

  1. Connects to S3 where my python work script and dependencies are
  2. Launches and EC2 instance of a desired type
  3. Submits the python work script to be processed by the EC2 instance

In addition, within my python work script, how do I save the results of the work script back to S3?

Finally, how do I ensure that the python version that I access via AWS has all the packages that are needed to successfully run my python work script?

Sorry if the question is too high-level and for any conceptual mistakes. Thank you for any pointers!

Maarölli
  • 375
  • 1
  • 3
  • 13
  • you have python script and dependencies on s3 that you want to run on EC2? – Ninad Gaikwad Jun 27 '19 at 17:16
  • Yes, exactly. So far I have used a remote company cluster on which to run the work script using sbatch/SLURM and now I would want to transition to AWS. Anyhow, happy to hear suggestions if there are better approaches. – Maarölli Jun 27 '19 at 19:39
  • Is there any reason you want to store this script on S3 first? Why not launch EC2 instance from console and then upload the script and install dependencies? Lastly, how long do you need work script to execute for? Would it not work on a lambda? – Ninad Gaikwad Jun 28 '19 at 02:03
  • No reason to store script/dependencies on S3 first. Could well do it after launching EC2 instance. I think AWS lambda is not the answer (again do not know much), as the timeout limits are too strict. Processing the python work script when using 16 cpus on a remote cluster takes 15-30hours. – Maarölli Jun 28 '19 at 04:35
  • OK then you can launch the EC2 instance and upload your work script + dependencies on it. You can assign the EC2 a role which will allow it access to s3 to store results there. Would this be a good solution for you? – Ninad Gaikwad Jun 28 '19 at 06:18
  • If you already have a deployment script sitting somewhere, it can be used for EC2 too. You just need to configure EC2 network to load the dependencies (or you can use ssh script copy those dependencies over) – mootmoot Jun 28 '19 at 14:30

1 Answers1

0

To achieve this I would like to suggest more details to your current flow:

In the submission script:

  • Upload/Refresh any dependencies on the S3 bucket.
  • Launch an EC2 instance.

In the EC2 instance:

  • Download dependencies.
  • Do work.
  • Upload the results to S3.
  • Terminate instance.

There are 2 simple ways to run commands on an EC2 instance, SSH or use the user-data attribute. For simplicity, and for your current use case, I would recommend using the user-data method.

First, you need to create an EC2-InstanceProfile with permissions to download/upload to the S3 bucket. Then you can create an EC2, install any python or pip packages and register it as an AMI.

Here is some reference code: Note this code is in python3 and suitable only for Windows machines.

submission.py:

import boto3

s3_client = boto3.client('s3')
ec2 = boto3.resource('ec2')

deps = {
    'remote' : [
        "/path/to/s3-bucket/obj.txt"
    ],

    'local' : [
        "/path/to/local-directory/obj.txt"
    ]
}

for remote, local in zip(deps['remote'], deps['local']):
    s3_client.upload_file(local, bucket_name, remote)

user_data = f"""<powershell>
cd {path_to_instance_worker_dir}; python {path_to_instance_worker_script}
</powershell>
"""

instance = ec2.create_instances(
    MinCount=1,
    MaxCount=1,
    ImageId=image_id,
    InstanceType=your_ec2_type,

    KeyName=your_key_name,
    IamInstanceProfile={
            'Name': instance_profile_name
    },
    SecurityGroupIds=[
        instance_security_group,
    ],
    UserData=user_data
)

instance_worker:

import boto3

s3_client = boto3.client('s3')

deps = {
    'remote' : [
        "/path/to/s3-bucket/obj.txt"
    ],

    'local' : [
        "/path/to/local-directory/obj.txt"
    ]
}

for remote, local in zip(deps['remote'], deps['local']):
    s3_client.download_file(bucket_name, remote, local)

result = do_work()

# write results to file 

s3_client.upload_file(result_file, bucket_name, result_remote)

# Get the instance ID from inside (This is only for Windows machines)
p = subprocess.Popen(["powershell.exe", "(Invoke-WebRequest -Uri 'http://169.254.169.254/latest/meta-data/instance-id').Content"])
    out = p.communicate()[0]
    instance_id = str(out.strip().decode('ascii'))

ec2_client.terminate_instances(InstanceIds=[instance_id, ])

In this code, I terminate the instance from within, in order to do that you must first obtain the instnace_id, have a look here for more references.

Finally, how do I ensure that the python version that I access via AWS has all the packages that are needed to successfully run my python work script?

In theory, you can use the user data to run any scripts or CLI commands you would like, including installing python and pip dependencies, but if it's too complicated/heavy to install, I would suggest you build an image and launch from it, as mentioned before.

Ben Zikri
  • 309
  • 1
  • 9