Highest Voted 'apache-spark-1.4' Questions

55

votes

2 answers

How to optimize shuffle spill in Apache Spark application

I am running a Spark streaming application with 2 workers. Application has a join and an union operations. All the batches are completing successfully but noticed that shuffle spill metrics are not consistent with input data size or output data size…

apache-spark spark-streaming apache-spark-1.4

asked Jun 12 '15 at 07:36

Vijay Innamuri

4,242
7
42
67

50

votes

6 answers

DataFrame join optimization - Broadcast Hash Join

I am trying to effectively join two DataFrames, one of which is large and the second is a bit smaller. Is there a way to avoid all this shuffling? I cannot set autoBroadCastJoinThreshold, because it supports only Integers - and the table I am…

apache-spark dataframe apache-spark-sql apache-spark-1.4

asked Sep 07 '15 at 09:26

NNamed

717
1
7
14

17

votes

3 answers

Spark off heap memory leak on Yarn with Kafka direct stream

I am running spark streaming 1.4.0 on Yarn (Apache distribution 2.6.0) with java 1.8.0_45 and also Kafka direct stream. I am also using spark with scala 2.11 support. The issue I am seeing is that both driver and executor containers are gradually…

apache-spark spark-streaming hadoop-yarn apache-spark-1.4

asked Jul 13 '15 at 18:01

Apoorva Sareen

179
4

11

votes

2 answers

Building Apache Spark using SBT: Invalid or corrupt jarfile

I'm trying to install Spark on my local machine. I have been following this guide. I have installed JDK-7 (also have JDK-8) and Scala 2.11.7. A problem occurs when I try to use sbt to build Spark 1.4.1. I get the following exception. NOTE: The…

scala apache-spark sbt apache-spark-1.4

asked Jul 26 '15 at 13:53

Black

4,483
8
38
55

9

votes

2 answers

How to handle null entries in SparkR

I have a SparkSQL DataFrame. Some entries in this data are empty but they don't behave like NULL or NA. How could I remove them? Any ideas? In R I can easily remove them but in sparkR it say that there is a problem with the S4 system/methods.…

r apache-spark sparkr apache-spark-1.4

asked Jul 23 '15 at 21:46

Ole Petersen

670
9
21

8

votes

3 answers

Cannot start spark-shell

I am using Spark 1.4.1. I can use spark-submit without problem. But when I ran ~/spark/bin/spark-shell I got the error below I have configured SPARK_HOME and JAVA_HOME. However, It was OK with Spark 1.2 15/10/08 02:40:30 WARN NativeCodeLoader:…

apache-spark apache-spark-1.4

asked Oct 08 '15 at 02:45

worldterminator

2,968
6
33
52

7

votes

0 answers

Custom Transformer in PySpark Pipeline with Cross Validation

I wrote a custom transformer like it is described here. When creating a pipeline with my transformer as first step I am able to train a (Logistic Regression) model for classification. However, when I want to perform cross validation with this…

python apache-spark pyspark apache-spark-1.4 apache-spark-ml

asked Sep 22 '15 at 10:44

vkoe

381
4
12

7

votes

1 answer

In Apache Spark SQL, How to close metastore connection from HiveContext

My project has unit tests for different HiveContext configurations (sometimes they are in one file as they are grouped by features.) After upgrading to Spark 1.4 I encounter a lot of 'java.sql.SQLException: Another instance of Derby may have already…

apache-spark thrift apache-spark-sql apache-spark-1.4

asked Aug 24 '15 at 23:49

tribbloid

4,026
14
64
103

5

votes

1 answer

Spark + Kafka integration - mapping of Kafka partitions to RDD partitions

I have a couple of basic questions related to Spark Streaming [Please let me know if these questions have been answered in other posts - I couldn't find any]: (i) In Spark Streaming, is the number of partitions in an RDD by default equal to the…

scala apache-spark apache-kafka spark-streaming apache-spark-1.4

asked Sep 30 '15 at 18:37

jithinpt

1,204
2
16
33

5

votes

3 answers

Find size of data stored in rdd from a text file in apache spark

I am new to Apache Spark (version 1.4.1). I wrote a small code to read a text file and stored its data in Rdd . Is there a way by which I can get the size of data in rdd . This is my code : import org.apache.spark.SparkContext import…

scala apache-spark apache-spark-1.4

asked Aug 24 '15 at 09:52

bob

4,595
2
25
35

4

votes

1 answer

Inconsistent JSON schema guess with Spark dataframes

Trying to read a JSON file with Spark 1.4.1 dataframes and to navigate inside. Seems the guessed schema is incorrect. JSON file is: { "FILE": { "TUPLE_CLI": [{ "ID_CLI": "C3-00000004", "TUPLE_ABO": [{ …

json scala schema apache-spark-sql apache-spark-1.4

asked Nov 26 '15 at 13:54

Victor

243
1
3
9

4

votes

5 answers

How to start a Spark Shell using pyspark in Windows?

I am a beginner in Spark and trying to follow instructions from here on how to initialize Spark shell from Python using cmd: http://spark.apache.org/docs/latest/quick-start.html But when I run in cmd the…

pyspark apache-spark-1.4

asked Jul 28 '15 at 00:08

Alex

41
1
1
2

3

votes

1 answer

Why can't YARN acquire any executor when dynamic allocation is enabled?

Jobs work smoothly when using YARN without enabling dynamic allocation feature. I am using Spark 1.4.0. This is what I am trying to do: rdd = sc.parallelize(range(1000000)) rdd.first() This is what I get in logs: 15/09/08 11:36:12 INFO…

apache-spark pyspark hadoop-yarn hortonworks-data-platform apache-spark-1.4

asked Sep 08 '15 at 08:49

gunererd

651
9
19

3

votes

1 answer

How to load history data when starting Spark Streaming process, and calculate running aggregations

I have some sales-related JSON data in my ElasticSearch cluster, and I would like to use Spark Streaming (using Spark 1.4.1) to dynamically aggregate incoming sales events from my eCommerce website via Kafka, to have a current view to the user's…

apache-spark apache-kafka spark-streaming apache-spark-sql apache-spark-1.4

asked Jul 27 '15 at 09:45

Tobi

31,405
8
58
90

2

votes

1 answer

Unable to save an RDD[String] as a text file using saveAsTextFile

When I try to write my RDD to a text file on HDFS as shown below, I am getting an error. val rdd = sc.textFile("/user/hadoop/dxld801/test.txt") val filtered = rdd.map({line=>…

scala classnotfoundexception apache-spark-1.4

asked Aug 18 '15 at 07:05

Ravitej Somayajula

71
1
7

Questions tagged [apache-spark-1.4]