Questions tagged [elastic-map-reduce]

Amazon Elastic MapReduce is a web service that enables the processing of large amounts of data.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

http://aws.amazon.com/elasticmapreduce/

See also

Synonymous tag :

452 questions
36
votes
10 answers

Backup AWS Dynamodb to S3

It has been suggested on Amazon docs http://aws.amazon.com/dynamodb/ among other places, that you can backup your dynamodb tables using Elastic Map Reduce, I have a general understanding of how this could work but I couldn't find any guides or…
Ali
  • 18,665
  • 21
  • 103
  • 138
23
votes
3 answers

Exporting Hive Table to a S3 bucket

I've created a Hive Table through an Elastic MapReduce interactive session and populated it from a CSV file like this: CREATE TABLE csvimport(id BIGINT, time STRING, log STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; LOAD DATA LOCAL INPATH…
seedhead
  • 3,655
  • 4
  • 32
  • 38
23
votes
2 answers

Why does Yarn on EMR not allocate all nodes to running Spark jobs?

I'm running a job on Apache Spark on Amazon Elastic Map Reduce (EMR). Currently I'm running on emr-4.1.0 which includes Amazon Hadoop 2.6.0 and Spark 1.5.0. When I start the job, YARN correctly has allocated all the worker nodes to the spark job…
retnuH
  • 1,525
  • 2
  • 11
  • 18
22
votes
3 answers

Spark + EMR using Amazon's "maximizeResourceAllocation" setting does not use all cores/vcores

I'm running an EMR cluster (version emr-4.2.0) for Spark using the Amazon specific maximizeResourceAllocation flag as documented here. According to those docs, "this option calculates the maximum compute and memory resources available for an…
retnuH
  • 1,525
  • 2
  • 11
  • 18
17
votes
2 answers

Slow Performance with Apache Spark Gradient Boosted Tree training runs

I'm experimenting with Gradient Boosted Trees learning algorithm from ML library of Spark 1.4. I'm solving a binary classification problem where my input is ~50,000 samples and ~500,000 features. My goal is to output the definition of the resulting…
17
votes
7 answers

Deleting file/folder from Hadoop

I'm running an EMR Activity inside a Data Pipeline analyzing log files and I get the following error when my Pipeline fails: Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory…
15
votes
7 answers

Scheduling A Job on AWS EC2

I have a website running on AWS EC2. I need to create a nightly job that generates a sitemap file and uploads the files to the various browsers. I'm looking for a utility on AWS that allows this functionality. I've considered the following: 1)…
threejeez
  • 2,314
  • 6
  • 30
  • 51
15
votes
1 answer

Get a yarn configuration from commandline

In EMR, is there a way to get a specific value of the configuration given the configuration key using the yarn command? For example I would like to do something like this yarn get-config yarn.scheduler.maximum-allocation-mb
fo_x86
  • 2,583
  • 1
  • 30
  • 41
14
votes
5 answers

Drop all partitions from a hive table?

How can I drop all partitions currently loaded in a Hive table? I can drop a single partition with alter table drop partition(a=, b=...); I can load all partitions with the recover partitions statement. But I cannot seem to drop all…
Matt Joiner
  • 112,946
  • 110
  • 377
  • 526
12
votes
3 answers

How can I wait for completion of an Elastic MapReduce job flow in a Java application?

Recently I've been working with Amazon Web Services (AWS) and I've noticed there is not much documentation on the subject, so I added my solution. I was writing an application using Amazon Elastic MapReduce (Amazon EMR). After the calculations ended…
11
votes
3 answers

Re-use Amazon Elastic MapReduce instance

I have tried a simple Map/Reduce task using Amazon Elastic MapReduce and it took just 3 mins to complete the task. Is it possible to re-use the same instance to run another task. Even though I have just used the instance for 3 mins Amazon will…
Maggie
  • 5,923
  • 8
  • 41
  • 56
11
votes
1 answer

Loading data with Hive, S3, EMR, and Recover Partitions

SOLVED: See Update #2 below for the 'solution' to this issue. ~~~~~~~ In s3, I have some log*.gz files stored in a nested directory structure like: s3://($BUCKET)/y=2012/m=11/d=09/H=10/ I'm attempting to load these into Hive on Elastic Map Reduce…
Mike Repass
  • 6,825
  • 5
  • 38
  • 35
11
votes
1 answer

Elastic Mapreduce Map output lost

I'm running a large (more than 100 nodes) series of mapreduce jobs on Amazon Elastic MapReduce. In the reduce phase, already-completed map tasks keep failing with Map output lost, rescheduling: getMapOutput(attempt_201204182047_0053_m_001053_0,299)…
dspyz
  • 5,280
  • 2
  • 25
  • 63
10
votes
5 answers

Are there any distributed machine learning libraries for using Python with Hadoop?

I have set myself up with Amazon Elastic MapReduce in order to preform various standard machine learning tasks. I have used Python extensively for local machine learning in the past and I do not know Java. As far as I can tell there are no well…
iRoygbiv
  • 865
  • 2
  • 7
  • 21
10
votes
2 answers

Configuring external data source for Elastic MapReduce

We want to use Amazon Elastic MapReduce on top of our current DB (we are using Cassandra on EC2). Looking at the Amazon EMR FAQ, it should be possible: Amazon EMR FAQ: Q: Can I load my data from the internet or somewhere other than Amazon…
Víctor Penela
  • 474
  • 1
  • 6
  • 16
1
2 3
30 31