1

I think I have a similar question to the one asked in this thread, but I will try to be more specific...

What is the best way to periodically process data using aws. For example, I want to process some reports I aggregated into S3 once per minute. Is the best way to do it to add a step to an existing job every minute via some script?

Community
  • 1
  • 1

3 Answers3

2

Well, for now I will write a script that:

  • Gets the job flow details from aws
  • If the job is in status waiting - add a new step to the job
  • Since I am using aws PHP AmazonEMR, I will add some code to handle the 256 max steps limitation (create new job flow with the same parameters and terminate the existing one if I have more than 200 steps, for example).

I'll update this thread once I have the code ready and later on once I see how it holds in production for a few weeks

  • How did it go in production? I see it's been years. Just asking. :) – siliconsenthil Sep 27 '17 at 09:40
  • @siliconsenthil - I'm sure these days there are much better solutions (like lambada), but this script has been running since 2012 in production. Works well with little maintenance. – Yariv Azatchi Aug 17 '18 at 07:40
1

I would use a bootstrap action to install a cron job on the master node.

nkadwa
  • 839
  • 8
  • 16
0

Consider the (new) AWS Lambda service. You upload your script and set a S3 bucket/folder to monitor. The code is run every time new input is added to the folder, and spins up EC2 instances as necessary to keep up with demand.

https://aws.amazon.com/lambda/

jmilloy
  • 7,875
  • 11
  • 53
  • 86