Amazon Elastic Map Reduce - Keep Server alive?

Question

I am testing jobs in EMR and each and every test takes a lot of time to start up. Is there a way to keep the server/master node alive in Amazon EMR? I know this can be done with the API. But, I wanted to know if this can be done in the aws console?

Check the top answer to the same question here: http://stackoverflow.com/questions/6880283/re-use-amazon-elastic-mapreduce-instance — Matthew Rathbone, Aug 31 '11 at 16:28

score 2 · Answer 1 · answered Feb 25 '11 at 05:28

2

You cannot do this from the AWS console. To quote the developer guide

The Amazon Elastic MapReduce tab in the AWS Management Console does not support adding steps to a job flow.

You can only do this via the CLI and API, by creating a job flow, then adding steps to it.

$ ./elastic-mapreduce --create --active --stream

answered Feb 25 '11 at 05:28

Ronen Botzer

6,951
22
41

in fact, you don't even need to pass --stream, just --create --alive will do it. – Matthew Rathbone Aug 10 '11 at 19:24

score 1 · Answer 2 · answered Jul 26 '12 at 22:28

You can't do this with the web console - but through the API and programming tools, you will be able to add multiple steps to a long-running job, which is what I do. That way you can fire off jobs one after the other on the same long-running cluster, without having to re-create a new one each time.

If you are familiar with Python, I highly recommend the Boto library. The other AWS API tools let you do this as well.

If you follow the Boto EMR tutorial, you'll find some examples:

Just to give you an idea, this is what I do (with streaming jobs):

# Connect to EMR
conn = boto.connect_emr()

# Start long-running job, don't forget keep_alive setting
jobid = conn.run_jobflow(name='My jobflow',
                          log_uri='s3://<my log uri>/jobflow_logs',
                          keep_alive=True)

# Create your streaming job
step = StreamingStep(...)

# Add the step to the job
conn.add_jobflow_steps(jobid, [step])

# Wait till its complete
while True:
  state = conn.describe_jobflow(jobid).steps[-1].state
  if (state == "COMPLETED"):
    break
  if (state == "FAILED") or (state == "TERMINATED") or (state == "CANCELLED"):
    print >> sys.stderr, ("EMR job failed! Message = %s!") % (state)
    sys.exit(1)
  time.sleep (60)

# Create your next job here and add it to the EMR cluster
step = StreamingStep(...)
conn.add_jobflow_steps(jobid, [step])

# Repeat :)

score 0 · Answer 3 · answered Jun 21 '10 at 20:16

0

to keep the machine alive start an interactive pig session. Then the machine won't shut down. You can then execute your map/reduce logic from the command line using:

cat infile.txt | yourMapper | sort | yourReducer > outfile.txt

answered Jun 21 '10 at 20:16

JD Long

59,675
58
202
294

You'd have to SSHing into the master, and this command chain doesn't run in Hadoop, so there's no parallelism to be gained from. – Ronen Botzer Feb 25 '11 at 22:30

Amazon Elastic Map Reduce - Keep Server alive?

3 Answers3