Periodically process data using aws emr

Question

I think I have a similar question to the one asked in this thread, but I will try to be more specific...

What is the best way to periodically process data using aws. For example, I want to process some reports I aggregated into S3 once per minute. Is the best way to do it to add a step to an existing job every minute via some script?

score 2 · Answer 1 · answered May 22 '12 at 09:04

2

Well, for now I will write a script that:

Gets the job flow details from aws
If the job is in status waiting - add a new step to the job
Since I am using aws PHP AmazonEMR, I will add some code to handle the 256 max steps limitation (create new job flow with the same parameters and terminate the existing one if I have more than 200 steps, for example).

I'll update this thread once I have the code ready and later on once I see how it holds in production for a few weeks

answered May 22 '12 at 09:04

Yariv Azatchi

116
6

How did it go in production? I see it's been years. Just asking. :) – siliconsenthil Sep 27 '17 at 09:40
@siliconsenthil - I'm sure these days there are much better solutions (like lambada), but this script has been running since 2012 in production. Works well with little maintenance. – Yariv Azatchi Aug 17 '18 at 07:40

score 1 · Answer 2 · answered Jun 13 '12 at 14:43

1

I would use a bootstrap action to install a cron job on the master node.

answered Jun 13 '12 at 14:43

nkadwa

839
8
16

score 0 · Answer 3 · answered Nov 13 '14 at 22:20

Consider the (new) AWS Lambda service. You upload your script and set a S3 bucket/folder to monitor. The code is run every time new input is added to the folder, and spins up EC2 instances as necessary to keep up with demand.

https://aws.amazon.com/lambda/

Periodically process data using aws emr

3 Answers3