4

I'm running a python script, using Boto3 (first time using boto/3), on my local server which monitors S3 bucket for new files. When it detects new files in the bucket, it starts a stopped EC2 instance, which has software loaded onto it to process these said files, and then needs to somehow instruct S3/EC2 to copy the new files from S3 to EC2. How can I achieve that using Boto3 script which is running on my local server ?

Essentially, the script running locally is the orchestrator of the process and needs to start the instance when there are new files to process and have them processed on the EC2 instance and copy the processed files back to S3. I'm currently stuck at trying to figure how to get the files copied over to EC2 from S3 by the script running locally. I'd like to avoid having to download from S3 to local server and then upload to EC2.

Suggestions/ideas ?

bioinformant
  • 41
  • 1
  • 2

2 Answers2

3

You should consider using Lambda for any S3 event-based processing. Why launch and run servers when you don't have to?

jarmod
  • 71,565
  • 16
  • 115
  • 122
  • 1
    Well the 'processing' I refer to is actually a quite complex pipeline performing bioinformatics analysis ... and requires elaborate set up. I was merely using the uploads to S3 bucket as a trigger to start/stop preconfigured instance. Though, I do see use for Lambda for some parts ... for parts of it that rely on events as means to triggering various steps. Do you know if Lambda has been used for bioinformatics pipelines ? – bioinformant Jul 18 '15 at 09:24
  • 1
    @bioinformant I would consider using Lambda as the initiator of your pipeline, rather than polling S3. Then I would consider using Simple Workflow (or possibly Data Pipeline) for the pipeline itself. Orchestrating a complex pipeline, sequencing, co-ordination, handling failures, out of band processes and so on cries out for a reliable workflow mechanism. – jarmod Jul 18 '15 at 12:37
0

If the name of the bucket and other params don't change, you can achieve it simply by having a script on your EC2 instance that would pull the latest content from the bucket and set this script to be triggered every time your EC2 starts up.

If the s3 command parameters do change and you must run it from your local machine with boto, you'll need to find a way to ssh into the EC2 instance using boto. Check this module: boto.manage.cmdshell and a similar question: Boto Execute shell command on ec2 instance

Community
  • 1
  • 1
semirami
  • 289
  • 2
  • 7
  • The name of the bucket will stay constant ... however, the list of files (i.e. the new files that are uploaded to S3 ... and need to be processed on the EC2) will change. I.e. I need to have that list of files passed to EC2 somehow for it to download them and then process. Thanks for the cmdshell suggestion ... will look into that ... – bioinformant Jul 16 '15 at 20:37