Questions tagged [amazon-emr]

Amazon Elastic MapReduce (Amazon EMR) is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

3368 questions
67
votes
5 answers

"Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used" on an EMR cluster with 75GB of memory

I'm running a 5 node Spark cluster on AWS EMR each sized m3.xlarge (1 master 4 slaves). I successfully ran through a 146Mb bzip2 compressed CSV file and ended up with a perfectly aggregated result. Now I'm trying to process a ~5GB bzip2 CSV file on…
lauri108
  • 1,381
  • 1
  • 13
  • 22
63
votes
4 answers

AWS VPC identify private and public subnet

I have a VPC in AWS account and there are 5 subnets associated with that VPC. Subnets are of 2 types - Public and private. How to identify which subnet is public and which is private ? Each subnet has CIDR 10.249.?.? range. Basically when I launch…
user1846749
  • 2,165
  • 3
  • 23
  • 36
51
votes
13 answers

Application report for application_ (state: ACCEPTED) never ends for Spark Submit (with Spark 1.2.0 on YARN)

I am running kinesis plus spark application https://spark.apache.org/docs/1.2.0/streaming-kinesis-integration.html I am running as below command on ec2 instance : ./spark/bin/spark-submit --class org.apache.spark.examples.streaming.myclassname…
Sam
  • 1,333
  • 5
  • 23
  • 36
39
votes
7 answers

How do you make a HIVE table out of JSON data?

I want to create a Hive table out of some JSON data (nested) and run queries on it? Is this even possible? I've gotten as far as uploading the JSON file to S3 and launching an EMR instance but I don't know what to type in the hive console to get…
nickponline
  • 25,354
  • 32
  • 99
  • 167
36
votes
7 answers

Extremely slow S3 write times from EMR/ Spark

I'm writing to see if anyone knows how to speed up S3 write times from Spark running in EMR? My Spark Job takes over 4 hours to complete, however the cluster is only under load during the first 1.5 hours. I was curious into what Spark was doing all…
jspooner
  • 10,975
  • 11
  • 58
  • 81
30
votes
4 answers

Amazon Emr - What is the need of Task nodes when we have Core nodes?

I am learning about Amazon EMR lately, and according to my knowledge the EMR cluster lets us choose 3 nodes. Master which runs the Primary Hadoop daemons like NameNode,Job Tracker and Resource manager. Core which runs Datanode and Tasktracker…
Taher Koitawala
  • 301
  • 1
  • 3
  • 6
30
votes
2 answers

How do you delete an AWS EMR Cluster?

I've been playing around with AWS EMR and I now have a few clusters that are terminated and that I want to delete: However, there is no obvious option to delete them. How do I make them go away?
vy32
  • 28,461
  • 37
  • 122
  • 246
28
votes
7 answers

How do I make matplotlib work in AWS EMR Jupyter notebook?

This is very close to this question, but I have added a few details specific to my question: Matplotlib Plotting using AWS-EMR jupyter notebook I would like to find a way to use matplotlib inside my Jupyter notebook. Here is the code-snippet in…
Matt
  • 5,404
  • 3
  • 27
  • 39
28
votes
10 answers

pyspark error does not exist in the jvm error when initializing SparkContext

I am using spark over emr and writing a pyspark script, I am getting an error when trying to from pyspark import SparkContext sc = SparkContext() this is the error File "pyex.py", line 5, in sc = SparkContext() File…
thebeancounter
  • 4,261
  • 8
  • 61
  • 109
27
votes
1 answer

How to configure high performance BLAS/LAPACK for Breeze on Amazon EMR, EC2

I am trying to set up an environment to support exploratory data analytics on a cluster. Based on an initial survey of what's out there my target is use Scala/Spark with Amazon EMR to provision the cluster. Currently I'm just trying to get some…
Tim Ryan
  • 1,010
  • 2
  • 11
  • 19
26
votes
5 answers

Python pip install pyarrow error, unable to execute 'cmake'

I'm trying to install the pyarrow on a master instance of my EMR cluster, however I'm always receiving this error. [hadoop@ip-XXX-XXX-XXX-XXX ~]$ sudo /usr/bin/pip-3.4 install pyarrow Collecting pyarrow Downloading…
Yiming Wu
  • 611
  • 1
  • 5
  • 11
26
votes
3 answers

Pyspark - Load file: Path does not exist

I am a newbie to Spark. I'm trying to read a local csv file within an EMR cluster. The file is located in: /home/hadoop/. The script that I'm using is this one: spark = SparkSession \ .builder \ .appName("Protob Conversion to Parquet") \ …
ultraInstinct
  • 4,063
  • 10
  • 36
  • 53
26
votes
3 answers

How to handle changing parquet schema in Apache Spark

I have run into a problem where I have Parquet data as daily chunks in S3 (in the form of s3://bucketName/prefix/YYYY/MM/DD/) but I cannot read the data in AWS EMR Spark from different dates because some column types do not match and I get one of…
V. Samma
  • 2,558
  • 8
  • 30
  • 34
25
votes
2 answers

Strange spark ERROR on AWS EMR

I have a really simple PySpark script that creates a dataframe from some parquet data on S3 and then call count() method and print out the number of records. I run the script on AWS EMR cluster and I'm seeing following strange WARN…
seiya
  • 1,477
  • 3
  • 17
  • 26
24
votes
6 answers

Does Hive have something equivalent to DUAL?

I'd like to run statements like SELECT date_add('2008-12-31', 1) FROM DUAL Does Hive (running on Amazon EMR) have something similar?
jbreed
  • 1,514
  • 5
  • 22
  • 35
1
2 3
99 100