Questions tagged [spark-streaming]

Spark Streaming is an extension of the core Apache Spark API that enables high-throughput, fault-tolerant stream processing of live data streams. From the version 1.3.0, it supports exactly-once processing semantics, even in face of failures.

5565 questions
67
votes
3 answers

Difference in Used, Committed and Max Heap Memory

I am monitoring a spark executor JVM of a OutOfMemoryException. I used Jconsole to connect to executor JVM. Following is the snapshot of Jconsole: In the image used memory is shown as 3.8G and committed memory is 8.6G and Max memory is also…
Alok
  • 1,374
  • 3
  • 18
  • 44
55
votes
2 answers

How to optimize shuffle spill in Apache Spark application

I am running a Spark streaming application with 2 workers. Application has a join and an union operations. All the batches are completing successfully but noticed that shuffle spill metrics are not consistent with input data size or output data size…
Vijay Innamuri
  • 4,242
  • 7
  • 42
  • 67
53
votes
4 answers

Drop spark dataframe from cache

I am using Spark 1.3.0 with python api. While transforming huge dataframes, I cache many DFs for faster execution; df1.cache() df2.cache() Once use of certain dataframe is over and is no longer needed how can I drop DF from memory (or un-cache…
ankit patel
  • 1,399
  • 5
  • 17
  • 29
46
votes
3 answers

Spark using python: How to resolve Stage x contains a task of very large size (xxx KB). The maximum recommended task size is 100 KB

I've just created python list of range(1,100000). Using SparkContext done the following steps: a = sc.parallelize([i for i in range(1, 100000)]) b = sc.parallelize([i for i in range(1, 100000)]) c = a.zip(b) >>> [(1, 1), (2, 2), -----] sum =…
user2959723
  • 651
  • 2
  • 7
  • 13
42
votes
2 answers

build.sbt: how to add spark dependencies

Hello I am trying to download spark-core, spark-streaming, twitter4j, and spark-streaming-twitter in the build.sbt file below: name := "hello" version := "1.0" scalaVersion := "2.11.8" libraryDependencies += "org.apache.spark" %% "spark-core" %…
Bobby
  • 537
  • 1
  • 5
  • 14
39
votes
6 answers

How can I update a broadcast variable in spark streaming?

I have, I believe, a relatively common use case for spark streaming: I have a stream of objects that I would like to filter based on some reference data Initially, I thought that this would be a very simple thing to achieve using a Broadcast…
Andrew Stubbs
  • 4,322
  • 3
  • 29
  • 48
37
votes
4 answers

How to know what is the reason for ClosedChannelExceptions with spark-shell in YARN client mode?

I have been trying to run spark-shell in YARN client mode, but I am getting a lot of ClosedChannelException errors. I am using spark 2.0.0 build for Hadoop 2.6. Here are the exceptions : $ spark-2.0.0-bin-hadoop2.6/bin/spark-shell --master yarn…
aks
  • 1,019
  • 1
  • 9
  • 17
37
votes
8 answers

How to write spark streaming DF to Kafka topic

I am using Spark Streaming to process data between two Kafka queues but I can not seem to find a good way to write on Kafka from Spark. I have tried this: input.foreachRDD(rdd => rdd.foreachPartition(partition => partition.foreach { case…
36
votes
6 answers

Spark DataFrame: does groupBy after orderBy maintain that order?

I have a Spark 2.0 dataframe example with the following structure: id, hour, count id1, 0, 12 id1, 1, 55 .. id1, 23, 44 id2, 0, 12 id2, 1, 89 .. id2, 23, 34 etc. It contains 24 entries for each id (one for each hour of the day) and is ordered by…
Ana Todor
  • 781
  • 1
  • 6
  • 15
35
votes
1 answer

The value of "spark.yarn.executor.memoryOverhead" setting?

The value of spark.yarn.executor.memoryOverhead in a Spark job with YARN should be allocated to App or just the max value?
33
votes
5 answers

Queries with streaming sources must be executed with writeStream.start();

I'm trying to read the messages from kafka (version 10) in spark and trying to print it. import spark.implicits._ val spark = SparkSession .builder .appName("StructuredNetworkWordCount") …
shivali
  • 437
  • 1
  • 6
  • 13
33
votes
1 answer

Use Spring together with Spark

I'm developing a Spark Application and I'm used to Spring as a Dependency Injection Framework. Now I'm stuck with the problem, that the processing part uses the @Autowired functionality of Spring, but it is serialized and deserialized by Spark. So…
itsme
  • 852
  • 1
  • 10
  • 23
32
votes
4 answers

How do I stop a spark streaming job?

I have a Spark Streaming job which has been running continuously. How do I stop the job gracefully? I have read the usual recommendations of attaching a shutdown hook in the job monitoring and sending a SIGTERM to the job. sys.ShutdownHookThread { …
Saket
  • 3,079
  • 3
  • 29
  • 48
29
votes
1 answer

What's the meaning of DStream.foreachRDD function?

In spark streaming, every batch interval of data always generate one and only one RDD, why do we use foreachRDD() to foreach RDD? RDD is only one, needn't foreach. In my testing, I never see RDD more than one.
Guo
  • 1,761
  • 2
  • 22
  • 45
29
votes
3 answers

How to specify multiple dependencies using --packages for spark-submit?

I have the following as the command line to start a spark streaming job. spark-submit --class com.biz.test \ --packages \ org.apache.spark:spark-streaming-kafka_2.10:1.3.0 \ …
davidpricedev
  • 2,107
  • 2
  • 20
  • 34
1
2 3
99 100