Highest Voted 'spark-streaming' Questions

67

votes

3 answers

Difference in Used, Committed and Max Heap Memory

I am monitoring a spark executor JVM of a OutOfMemoryException. I used Jconsole to connect to executor JVM. Following is the snapshot of Jconsole: In the image used memory is shown as 3.8G and committed memory is 8.6G and Max memory is also…

asked Jan 04 '17 at 16:25

Alok

1,374
3
18
44

55

votes

2 answers

How to optimize shuffle spill in Apache Spark application

I am running a Spark streaming application with 2 workers. Application has a join and an union operations. All the batches are completing successfully but noticed that shuffle spill metrics are not consistent with input data size or output data size…

apache-spark spark-streaming apache-spark-1.4

asked Jun 12 '15 at 07:36

Vijay Innamuri

4,242
7
42
67

53

votes

4 answers

Drop spark dataframe from cache

I am using Spark 1.3.0 with python api. While transforming huge dataframes, I cache many DFs for faster execution; df1.cache() df2.cache() Once use of certain dataframe is over and is no longer needed how can I drop DF from memory (or un-cache…

apache-spark apache-spark-sql spark-streaming

asked Aug 26 '15 at 05:40

ankit patel

1,399
5
17
29

46

votes

3 answers

Spark using python: How to resolve Stage x contains a task of very large size (xxx KB). The maximum recommended task size is 100 KB

I've just created python list of range(1,100000). Using SparkContext done the following steps: a = sc.parallelize([i for i in range(1, 100000)]) b = sc.parallelize([i for i in range(1, 100000)]) c = a.zip(b) >>> [(1, 1), (2, 2), -----] sum =…

apache-spark spark-streaming

asked Mar 05 '15 at 13:10

user2959723

651
2
7
13

42

votes

2 answers

build.sbt: how to add spark dependencies

Hello I am trying to download spark-core, spark-streaming, twitter4j, and spark-streaming-twitter in the build.sbt file below: name := "hello" version := "1.0" scalaVersion := "2.11.8" libraryDependencies += "org.apache.spark" %% "spark-core" %…

scala apache-spark sbt spark-streaming

asked Jun 22 '16 at 03:41

Bobby

537
1
5
14

39

votes

6 answers

How can I update a broadcast variable in spark streaming?

I have, I believe, a relatively common use case for spark streaming: I have a stream of objects that I would like to filter based on some reference data Initially, I thought that this would be a very simple thing to achieve using a Broadcast…

java scala apache-spark spark-streaming broadcast

asked Oct 27 '15 at 15:38

Andrew Stubbs

4,322
3
29
48

37

votes

4 answers

How to know what is the reason for ClosedChannelExceptions with spark-shell in YARN client mode?

I have been trying to run spark-shell in YARN client mode, but I am getting a lot of ClosedChannelException errors. I am using spark 2.0.0 build for Hadoop 2.6. Here are the exceptions : $ spark-2.0.0-bin-hadoop2.6/bin/spark-shell --master yarn…

hadoop apache-spark spark-streaming hadoop-yarn

asked Sep 13 '16 at 10:26

aks

1,019
1
9
17

37

votes

8 answers

How to write spark streaming DF to Kafka topic

I am using Spark Streaming to process data between two Kafka queues but I can not seem to find a good way to write on Kafka from Spark. I have tried this: input.foreachRDD(rdd => rdd.foreachPartition(partition => partition.foreach { case…

scala apache-spark apache-kafka spark-streaming spark-streaming-kafka

asked Jul 23 '15 at 14:39

Chobeat

3,445
6
41
59

36

votes

6 answers

Spark DataFrame: does groupBy after orderBy maintain that order?

I have a Spark 2.0 dataframe example with the following structure: id, hour, count id1, 0, 12 id1, 1, 55 .. id1, 23, 44 id2, 0, 12 id2, 1, 89 .. id2, 23, 34 etc. It contains 24 entries for each id (one for each hour of the day) and is ordered by…

scala apache-spark apache-spark-sql spark-streaming

asked Sep 15 '16 at 07:45

Ana Todor

781
1
6
15

35

votes

1 answer

The value of "spark.yarn.executor.memoryOverhead" setting?

The value of spark.yarn.executor.memoryOverhead in a Spark job with YARN should be allocated to App or just the max value?

apache-spark apache-spark-sql spark-streaming apache-spark-mllib

asked Dec 09 '16 at 09:40

liyong

377
1
3
9

33

votes

5 answers

Queries with streaming sources must be executed with writeStream.start();

I'm trying to read the messages from kafka (version 10) in spark and trying to print it. import spark.implicits._ val spark = SparkSession .builder .appName("StructuredNetworkWordCount") …

scala apache-spark-sql spark-streaming

asked Nov 15 '16 at 12:22

shivali

437
1
6
13

33

votes

1 answer

Use Spring together with Spark

I'm developing a Spark Application and I'm used to Spring as a Dependency Injection Framework. Now I'm stuck with the problem, that the processing part uses the @Autowired functionality of Spring, but it is serialized and deserialized by Spark. So…

java spring apache-spark spark-streaming

asked May 05 '15 at 12:48

itsme

852
1
10
23

32

votes

4 answers

How do I stop a spark streaming job?

I have a Spark Streaming job which has been running continuously. How do I stop the job gracefully? I have read the usual recommendations of attaching a shutdown hook in the job monitoring and sending a SIGTERM to the job. sys.ShutdownHookThread { …

apache-spark spark-streaming

asked Sep 15 '15 at 09:41

Saket

3,079
3
29
48

29

votes

1 answer

What's the meaning of DStream.foreachRDD function?

In spark streaming, every batch interval of data always generate one and only one RDD, why do we use foreachRDD() to foreach RDD? RDD is only one, needn't foreach. In my testing, I never see RDD more than one.

apache-spark spark-streaming

asked Apr 05 '16 at 08:57

Guo

1,761
2
22
45

29

votes

3 answers

How to specify multiple dependencies using --packages for spark-submit?

I have the following as the command line to start a spark streaming job. spark-submit --class com.biz.test \ --packages \ org.apache.spark:spark-streaming-kafka_2.10:1.3.0 \ …

apache-spark hbase spark-streaming

asked Nov 25 '15 at 23:10

davidpricedev

2,107
2
20
34

Questions tagged [spark-streaming]