Questions tagged [elastic-map-reduce]

Amazon Elastic MapReduce is a web service that enables the processing of large amounts of data.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

http://aws.amazon.com/elasticmapreduce/

Synonymous tag : emr

452 questions

votes

10 answers

Backup AWS Dynamodb to S3

It has been suggested on Amazon docs http://aws.amazon.com/dynamodb/ among other places, that you can backup your dynamodb tables using Elastic Map Reduce, I have a general understanding of how this could work but I couldn't find any guides or…

amazon-s3 backup amazon-dynamodb elastic-map-reduce

asked Nov 29 '12 at 16:49

Ali

18,665
21
103
138

votes

3 answers

Exporting Hive Table to a S3 bucket

I've created a Hive Table through an Elastic MapReduce interactive session and populated it from a CSV file like this: CREATE TABLE csvimport(id BIGINT, time STRING, log STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; LOAD DATA LOCAL INPATH…

amazon-s3 hive elastic-map-reduce emr

asked Feb 28 '12 at 20:48

seedhead

3,655
4
32
38

votes

2 answers

Why does Yarn on EMR not allocate all nodes to running Spark jobs?

I'm running a job on Apache Spark on Amazon Elastic Map Reduce (EMR). Currently I'm running on emr-4.1.0 which includes Amazon Hadoop 2.6.0 and Spark 1.5.0. When I start the job, YARN correctly has allocated all the worker nodes to the spark job…

apache-spark hadoop-yarn emr amazon-emr elastic-map-reduce

asked Nov 26 '15 at 14:16

retnuH

1,525
2
11
18

votes

3 answers

Spark + EMR using Amazon's "maximizeResourceAllocation" setting does not use all cores/vcores

I'm running an EMR cluster (version emr-4.2.0) for Spark using the Amazon specific maximizeResourceAllocation flag as documented here. According to those docs, "this option calculates the maximum compute and memory resources available for an…

apache-spark hadoop-yarn emr amazon-emr elastic-map-reduce

asked Nov 30 '15 at 16:51

retnuH

1,525
2
11
18

votes

2 answers

Slow Performance with Apache Spark Gradient Boosted Tree training runs

I'm experimenting with Gradient Boosted Trees learning algorithm from ML library of Spark 1.4. I'm solving a binary classification problem where my input is ~50,000 samples and ~500,000 features. My goal is to output the definition of the resulting…

amazon-web-services machine-learning apache-spark elastic-map-reduce

asked Sep 21 '15 at 19:22

Vlad Kutsenko

votes

7 answers

Deleting file/folder from Hadoop

I'm running an EMR Activity inside a Data Pipeline analyzing log files and I get the following error when my Pipeline fails: Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory…

hadoop amazon-web-services amazon-s3 elastic-map-reduce

asked May 28 '13 at 16:47

cevallos.valtira

votes

7 answers

Scheduling A Job on AWS EC2

I have a website running on AWS EC2. I need to create a nightly job that generates a sitemap file and uploads the files to the various browsers. I'm looking for a utility on AWS that allows this functionality. I've considered the following: 1)…

amazon-ec2 amazon-web-services cron jobs elastic-map-reduce

asked Jan 10 '12 at 23:21

threejeez

2,314
6
30
51

votes

1 answer

Get a yarn configuration from commandline

In EMR, is there a way to get a specific value of the configuration given the configuration key using the yarn command? For example I would like to do something like this yarn get-config yarn.scheduler.maximum-allocation-mb

hadoop hadoop-yarn hadoop2 emr elastic-map-reduce

asked Jan 07 '16 at 22:31

fo_x86

2,583
1
30
41

votes

5 answers

Drop all partitions from a hive table?

How can I drop all partitions currently loaded in a Hive table? I can drop a single partition with alter table drop partition(a=, b=...); I can load all partitions with the recover partitions statement. But I cannot seem to drop all…

hive elastic-map-reduce

asked Mar 19 '13 at 05:52

Matt Joiner

112,946
110
377
526

votes

3 answers

How can I wait for completion of an Elastic MapReduce job flow in a Java application?

Recently I've been working with Amazon Web Services (AWS) and I've noticed there is not much documentation on the subject, so I added my solution. I was writing an application using Amazon Elastic MapReduce (Amazon EMR). After the calculations ended…

java amazon-web-services elastic-map-reduce amazon-emr

asked May 25 '12 at 16:47

siditom

votes

3 answers

Re-use Amazon Elastic MapReduce instance

I have tried a simple Map/Reduce task using Amazon Elastic MapReduce and it took just 3 mins to complete the task. Is it possible to re-use the same instance to run another task. Even though I have just used the instance for 3 mins Amazon will…

amazon-ec2 mapreduce elastic-map-reduce

asked Jul 30 '11 at 00:27

Maggie

5,923
8
41
56

votes

1 answer

Loading data with Hive, S3, EMR, and Recover Partitions

SOLVED: See Update #2 below for the 'solution' to this issue. ~~~~~~~ In s3, I have some log*.gz files stored in a nested directory structure like: s3://($BUCKET)/y=2012/m=11/d=09/H=10/ I'm attempting to load these into Hive on Elastic Map Reduce…

hadoop amazon-s3 amazon-web-services hive elastic-map-reduce

asked Nov 10 '12 at 03:53

Mike Repass

6,825
5
38
35

votes

1 answer

Elastic Mapreduce Map output lost

I'm running a large (more than 100 nodes) series of mapreduce jobs on Amazon Elastic MapReduce. In the reduce phase, already-completed map tasks keep failing with Map output lost, rescheduling: getMapOutput(attempt_201204182047_0053_m_001053_0,299)…

hadoop amazon-web-services jetty elastic-map-reduce amazon-emr

asked Apr 19 '12 at 06:39

dspyz

5,280
2
25
63

votes

5 answers

Are there any distributed machine learning libraries for using Python with Hadoop?

I have set myself up with Amazon Elastic MapReduce in order to preform various standard machine learning tasks. I have used Python extensively for local machine learning in the past and I do not know Java. As far as I can tell there are no well…

python hadoop mapreduce hadoop-streaming elastic-map-reduce

asked Jan 09 '13 at 11:03

iRoygbiv

votes

2 answers

Configuring external data source for Elastic MapReduce

We want to use Amazon Elastic MapReduce on top of our current DB (we are using Cassandra on EC2). Looking at the Amazon EMR FAQ, it should be possible: Amazon EMR FAQ: Q: Can I load my data from the internet or somewhere other than Amazon…

amazon-web-services cassandra elastic-map-reduce

asked Aug 29 '12 at 12:00

Víctor Penela

2 3

…

30 31 Next