I am testing jobs in EMR and each and every test takes a lot of time to start up. Is there a way to keep the server/master node alive in Amazon EMR? I know this can be done with the API. But, I wanted to know if this can be done in the aws console?
-
Check the top answer to the same question here: http://stackoverflow.com/questions/6880283/re-use-amazon-elastic-mapreduce-instance – Matthew Rathbone Aug 31 '11 at 16:28
3 Answers
You cannot do this from the AWS console. To quote the developer guide
The Amazon Elastic MapReduce tab in the AWS Management Console does not support adding steps to a job flow.
You can only do this via the CLI and API, by creating a job flow, then adding steps to it.
$ ./elastic-mapreduce --create --active --stream

- 6,951
- 22
- 41
-
in fact, you don't even need to pass --stream, just --create --alive will do it. – Matthew Rathbone Aug 10 '11 at 19:24
You can't do this with the web console - but through the API and programming tools, you will be able to add multiple steps to a long-running job, which is what I do. That way you can fire off jobs one after the other on the same long-running cluster, without having to re-create a new one each time.
If you are familiar with Python, I highly recommend the Boto library. The other AWS API tools let you do this as well.
If you follow the Boto EMR tutorial, you'll find some examples:
Just to give you an idea, this is what I do (with streaming jobs):
# Connect to EMR
conn = boto.connect_emr()
# Start long-running job, don't forget keep_alive setting
jobid = conn.run_jobflow(name='My jobflow',
log_uri='s3://<my log uri>/jobflow_logs',
keep_alive=True)
# Create your streaming job
step = StreamingStep(...)
# Add the step to the job
conn.add_jobflow_steps(jobid, [step])
# Wait till its complete
while True:
state = conn.describe_jobflow(jobid).steps[-1].state
if (state == "COMPLETED"):
break
if (state == "FAILED") or (state == "TERMINATED") or (state == "CANCELLED"):
print >> sys.stderr, ("EMR job failed! Message = %s!") % (state)
sys.exit(1)
time.sleep (60)
# Create your next job here and add it to the EMR cluster
step = StreamingStep(...)
conn.add_jobflow_steps(jobid, [step])
# Repeat :)

- 9,221
- 5
- 49
- 62
to keep the machine alive start an interactive pig session. Then the machine won't shut down. You can then execute your map/reduce logic from the command line using:
cat infile.txt | yourMapper | sort | yourReducer > outfile.txt

- 59,675
- 58
- 202
- 294
-
You'd have to SSHing into the master, and this command chain doesn't run in Hadoop, so there's no parallelism to be gained from. – Ronen Botzer Feb 25 '11 at 22:30