Questions tagged [apache-spark-1.4]

Use for questions specific to Apache Spark 1.4. For general questions related to Apache Spark use the tag [apache-spark].

31 questions
55
votes
2 answers

How to optimize shuffle spill in Apache Spark application

I am running a Spark streaming application with 2 workers. Application has a join and an union operations. All the batches are completing successfully but noticed that shuffle spill metrics are not consistent with input data size or output data size…
Vijay Innamuri
  • 4,242
  • 7
  • 42
  • 67
50
votes
6 answers

DataFrame join optimization - Broadcast Hash Join

I am trying to effectively join two DataFrames, one of which is large and the second is a bit smaller. Is there a way to avoid all this shuffling? I cannot set autoBroadCastJoinThreshold, because it supports only Integers - and the table I am…
NNamed
  • 717
  • 1
  • 7
  • 14
17
votes
3 answers

Spark off heap memory leak on Yarn with Kafka direct stream

I am running spark streaming 1.4.0 on Yarn (Apache distribution 2.6.0) with java 1.8.0_45 and also Kafka direct stream. I am also using spark with scala 2.11 support. The issue I am seeing is that both driver and executor containers are gradually…
11
votes
2 answers

Building Apache Spark using SBT: Invalid or corrupt jarfile

I'm trying to install Spark on my local machine. I have been following this guide. I have installed JDK-7 (also have JDK-8) and Scala 2.11.7. A problem occurs when I try to use sbt to build Spark 1.4.1. I get the following exception. NOTE: The…
Black
  • 4,483
  • 8
  • 38
  • 55
9
votes
2 answers

How to handle null entries in SparkR

I have a SparkSQL DataFrame. Some entries in this data are empty but they don't behave like NULL or NA. How could I remove them? Any ideas? In R I can easily remove them but in sparkR it say that there is a problem with the S4 system/methods.…
Ole Petersen
  • 670
  • 9
  • 21
8
votes
3 answers

Cannot start spark-shell

I am using Spark 1.4.1. I can use spark-submit without problem. But when I ran ~/spark/bin/spark-shell I got the error below I have configured SPARK_HOME and JAVA_HOME. However, It was OK with Spark 1.2 15/10/08 02:40:30 WARN NativeCodeLoader:…
worldterminator
  • 2,968
  • 6
  • 33
  • 52
7
votes
0 answers

Custom Transformer in PySpark Pipeline with Cross Validation

I wrote a custom transformer like it is described here. When creating a pipeline with my transformer as first step I am able to train a (Logistic Regression) model for classification. However, when I want to perform cross validation with this…
vkoe
  • 381
  • 4
  • 12
7
votes
1 answer

In Apache Spark SQL, How to close metastore connection from HiveContext

My project has unit tests for different HiveContext configurations (sometimes they are in one file as they are grouped by features.) After upgrading to Spark 1.4 I encounter a lot of 'java.sql.SQLException: Another instance of Derby may have already…
tribbloid
  • 4,026
  • 14
  • 64
  • 103
5
votes
1 answer

Spark + Kafka integration - mapping of Kafka partitions to RDD partitions

I have a couple of basic questions related to Spark Streaming [Please let me know if these questions have been answered in other posts - I couldn't find any]: (i) In Spark Streaming, is the number of partitions in an RDD by default equal to the…
jithinpt
  • 1,204
  • 2
  • 16
  • 33
5
votes
3 answers

Find size of data stored in rdd from a text file in apache spark

I am new to Apache Spark (version 1.4.1). I wrote a small code to read a text file and stored its data in Rdd . Is there a way by which I can get the size of data in rdd . This is my code : import org.apache.spark.SparkContext import…
bob
  • 4,595
  • 2
  • 25
  • 35
4
votes
1 answer

Inconsistent JSON schema guess with Spark dataframes

Trying to read a JSON file with Spark 1.4.1 dataframes and to navigate inside. Seems the guessed schema is incorrect. JSON file is: { "FILE": { "TUPLE_CLI": [{ "ID_CLI": "C3-00000004", "TUPLE_ABO": [{ …
Victor
  • 243
  • 1
  • 3
  • 9
4
votes
5 answers

How to start a Spark Shell using pyspark in Windows?

I am a beginner in Spark and trying to follow instructions from here on how to initialize Spark shell from Python using cmd: http://spark.apache.org/docs/latest/quick-start.html But when I run in cmd the…
Alex
  • 41
  • 1
  • 1
  • 2
3
votes
1 answer

Why can't YARN acquire any executor when dynamic allocation is enabled?

Jobs work smoothly when using YARN without enabling dynamic allocation feature. I am using Spark 1.4.0. This is what I am trying to do: rdd = sc.parallelize(range(1000000)) rdd.first() This is what I get in logs: 15/09/08 11:36:12 INFO…
3
votes
1 answer

How to load history data when starting Spark Streaming process, and calculate running aggregations

I have some sales-related JSON data in my ElasticSearch cluster, and I would like to use Spark Streaming (using Spark 1.4.1) to dynamically aggregate incoming sales events from my eCommerce website via Kafka, to have a current view to the user's…
2
votes
1 answer

Unable to save an RDD[String] as a text file using saveAsTextFile

When I try to write my RDD to a text file on HDFS as shown below, I am getting an error. val rdd = sc.textFile("/user/hadoop/dxld801/test.txt") val filtered = rdd.map({line=>…
1
2 3