Questions tagged [amazon-emr]

Amazon Elastic MapReduce (Amazon EMR) is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

3368 questions

votes

5 answers

"Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used" on an EMR cluster with 75GB of memory

I'm running a 5 node Spark cluster on AWS EMR each sized m3.xlarge (1 master 4 slaves). I successfully ran through a 146Mb bzip2 compressed CSV file and ended up with a perfectly aggregated result. Now I'm trying to process a ~5GB bzip2 CSV file on…

asked Nov 24 '16 at 08:33

lauri108

1,381
1
13
22

votes

4 answers

AWS VPC identify private and public subnet

I have a VPC in AWS account and there are 5 subnets associated with that VPC. Subnets are of 2 types - Public and private. How to identify which subnet is public and which is private ? Each subnet has CIDR 10.249.?.? range. Basically when I launch…

amazon-web-services amazon-emr amazon-vpc subnet

asked Feb 16 '18 at 16:17

user1846749

2,165
3
23
36

votes

13 answers

Application report for application_ (state: ACCEPTED) never ends for Spark Submit (with Spark 1.2.0 on YARN)

I am running kinesis plus spark application https://spark.apache.org/docs/1.2.0/streaming-kinesis-integration.html I am running as below command on ec2 instance : ./spark/bin/spark-submit --class org.apache.spark.examples.streaming.myclassname…

apache-spark hadoop-yarn amazon-emr amazon-kinesis

asked Jun 14 '15 at 11:35

Sam

1,333
5
23
36

votes

7 answers

How do you make a HIVE table out of JSON data?

I want to create a Hive table out of some JSON data (nested) and run queries on it? Is this even possible? I've gotten as far as uploading the JSON file to S3 and launching an EMR instance but I don't know what to type in the hive console to get…

json hadoop hive amazon-emr emr

asked Jul 13 '12 at 22:37

nickponline

25,354
32
99
167

votes

7 answers

Extremely slow S3 write times from EMR/ Spark

I'm writing to see if anyone knows how to speed up S3 write times from Spark running in EMR? My Spark Job takes over 4 hours to complete, however the cluster is only under load during the first 1.5 hours. I was curious into what Spark was doing all…

amazon-web-services apache-spark amazon-s3 amazon-emr

asked Mar 15 '17 at 23:14

jspooner

10,975
11
58
81

votes

4 answers

Amazon Emr - What is the need of Task nodes when we have Core nodes?

I am learning about Amazon EMR lately, and according to my knowledge the EMR cluster lets us choose 3 nodes. Master which runs the Primary Hadoop daemons like NameNode,Job Tracker and Resource manager. Core which runs Datanode and Tasktracker…

hadoop hadoop2 amazon-emr

asked Jan 07 '17 at 08:23

Taher Koitawala

votes

2 answers

How do you delete an AWS EMR Cluster?

I've been playing around with AWS EMR and I now have a few clusters that are terminated and that I want to delete: However, there is no obvious option to delete them. How do I make them go away?

amazon-web-services emr amazon-emr

asked Nov 11 '15 at 23:01

vy32

28,461
37
122
246

votes

7 answers

How do I make matplotlib work in AWS EMR Jupyter notebook?

This is very close to this question, but I have added a few details specific to my question: Matplotlib Plotting using AWS-EMR jupyter notebook I would like to find a way to use matplotlib inside my Jupyter notebook. Here is the code-snippet in…

python matplotlib pyspark jupyter-notebook amazon-emr

asked May 22 '19 at 21:00

Matt

5,404
3
27
39

votes

10 answers

pyspark error does not exist in the jvm error when initializing SparkContext

I am using spark over emr and writing a pyspark script, I am getting an error when trying to from pyspark import SparkContext sc = SparkContext() this is the error File "pyex.py", line 5, in sc = SparkContext() File…

python python-3.x apache-spark pyspark amazon-emr

asked Nov 05 '18 at 20:45

thebeancounter

4,261
8
61
109

votes

1 answer

How to configure high performance BLAS/LAPACK for Breeze on Amazon EMR, EC2

I am trying to set up an environment to support exploratory data analytics on a cluster. Based on an initial survey of what's out there my target is use Scala/Spark with Amazon EMR to provision the cluster. Currently I'm just trying to get some…

apache-spark amazon-ec2 amazon-emr scala-breeze jblas

asked Jun 16 '16 at 01:01

Tim Ryan

1,010
2
11
19

votes

5 answers

Python pip install pyarrow error, unable to execute 'cmake'

I'm trying to install the pyarrow on a master instance of my EMR cluster, however I'm always receiving this error. [hadoop@ip-XXX-XXX-XXX-XXX ~]$ sudo /usr/bin/pip-3.4 install pyarrow Collecting pyarrow Downloading…

python-3.x cmake pip amazon-emr pyarrow

asked Sep 05 '18 at 09:12

Yiming Wu

votes

3 answers

Pyspark - Load file: Path does not exist

I am a newbie to Spark. I'm trying to read a local csv file within an EMR cluster. The file is located in: /home/hadoop/. The script that I'm using is this one: spark = SparkSession \ .builder \ .appName("Protob Conversion to Parquet") \ …

apache-spark pyspark emr amazon-emr apache-spark-sql

asked Feb 07 '17 at 13:51

ultraInstinct

4,063
10
36
53

votes

3 answers

How to handle changing parquet schema in Apache Spark

I have run into a problem where I have Parquet data as daily chunks in S3 (in the form of s3://bucketName/prefix/YYYY/MM/DD/) but I cannot read the data in AWS EMR Spark from different dates because some column types do not match and I get one of…

apache-spark apache-spark-sql parquet amazon-emr

asked Dec 02 '16 at 07:52

V. Samma

2,558
8
30
34

votes

2 answers

Strange spark ERROR on AWS EMR

I have a really simple PySpark script that creates a dataframe from some parquet data on S3 and then call count() method and print out the number of records. I run the script on AWS EMR cluster and I'm seeing following strange WARN…

amazon-web-services apache-spark pyspark amazon-emr

asked Dec 04 '17 at 14:26

seiya

1,477
3
17
26

votes

6 answers

Does Hive have something equivalent to DUAL?

I'd like to run statements like SELECT date_add('2008-12-31', 1) FROM DUAL Does Hive (running on Amazon EMR) have something similar?

hadoop hive amazon-emr

asked Mar 20 '12 at 22:00

jbreed

1,514
5
22
35

2 3

…

99 100 Next