4

I am testing jobs in EMR and each and every test takes a lot of time to start up. Is there a way to keep the server/master node alive in Amazon EMR? I know this can be done with the API. But, I wanted to know if this can be done in the aws console?

Jonik
  • 80,077
  • 70
  • 264
  • 372
vkris
  • 2,095
  • 7
  • 22
  • 30
  • Check the top answer to the same question here: http://stackoverflow.com/questions/6880283/re-use-amazon-elastic-mapreduce-instance – Matthew Rathbone Aug 31 '11 at 16:28

3 Answers3

2

You cannot do this from the AWS console. To quote the developer guide

The Amazon Elastic MapReduce tab in the AWS Management Console does not support adding steps to a job flow.

You can only do this via the CLI and API, by creating a job flow, then adding steps to it.

$ ./elastic-mapreduce --create --active --stream
Ronen Botzer
  • 6,951
  • 22
  • 41
1

You can't do this with the web console - but through the API and programming tools, you will be able to add multiple steps to a long-running job, which is what I do. That way you can fire off jobs one after the other on the same long-running cluster, without having to re-create a new one each time.

If you are familiar with Python, I highly recommend the Boto library. The other AWS API tools let you do this as well.

If you follow the Boto EMR tutorial, you'll find some examples:

Just to give you an idea, this is what I do (with streaming jobs):

# Connect to EMR
conn = boto.connect_emr()

# Start long-running job, don't forget keep_alive setting
jobid = conn.run_jobflow(name='My jobflow',
                          log_uri='s3://<my log uri>/jobflow_logs',
                          keep_alive=True)

# Create your streaming job
step = StreamingStep(...)

# Add the step to the job
conn.add_jobflow_steps(jobid, [step])

# Wait till its complete
while True:
  state = conn.describe_jobflow(jobid).steps[-1].state
  if (state == "COMPLETED"):
    break
  if (state == "FAILED") or (state == "TERMINATED") or (state == "CANCELLED"):
    print >> sys.stderr, ("EMR job failed! Message = %s!") % (state)
    sys.exit(1)
  time.sleep (60)

# Create your next job here and add it to the EMR cluster
step = StreamingStep(...)
conn.add_jobflow_steps(jobid, [step])

# Repeat :)
Suman
  • 9,221
  • 5
  • 49
  • 62
0

to keep the machine alive start an interactive pig session. Then the machine won't shut down. You can then execute your map/reduce logic from the command line using:

cat infile.txt | yourMapper | sort | yourReducer > outfile.txt
JD Long
  • 59,675
  • 58
  • 202
  • 294
  • You'd have to SSHing into the master, and this command chain doesn't run in Hadoop, so there's no parallelism to be gained from. – Ronen Botzer Feb 25 '11 at 22:30