Questions tagged [spark-structured-streaming]

Spark Structured Streaming allows processing live data streams using DataFrame and Dataset APIs.

Spark Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing with the Dataset/DataFrame APIs available in Python, R (in both and ) Scala and Java. Structured streaming is for Spark 2.x and is not to be confused with Spark Streaming which is for Spark 1.x.

External resources:

See also:

2360 questions
31
votes
11 answers

Integrating Spark Structured Streaming with the Confluent Schema Registry

I'm using a Kafka Source in Spark Structured Streaming to receive Confluent encoded Avro records. I intend to use Confluent Schema Registry, but the integration with spark structured streaming seems to be impossible. I have seen this question, but…
30
votes
3 answers

How to get Kafka offsets for structured query for manual and reliable offset management?

Spark 2.2 introduced a Kafka's structured streaming source. As I understand, it's relying on HDFS checkpoint directory to store offsets and guarantee an "exactly-once" message delivery. But old docks (like…
28
votes
8 answers

Why does format("kafka") fail with "Failed to find data source: kafka." (even with uber-jar)?

I use HDP-2.6.3.0 with Spark2 package 2.2.0. I'm trying to write a Kafka consumer, using the Structured Streaming API, but I'm getting the following error after submit the job to the cluster: Exception in thread "main"…
26
votes
8 answers

Why does Spark application fail with “ClassNotFoundException: Failed to find data source: kafka” as uber-jar with sbt assembly?

I'm trying to run a sample like StructuredKafkaWordCount. I started with the Spark Structured Streaming Programming guide. My code is package io.boontadata.spark.job1 import org.apache.spark.sql.SparkSession object DirectKafkaAggregateEvents { …
benjguin
  • 1,496
  • 1
  • 12
  • 21
25
votes
4 answers

Spark Strutured Streaming automatically converts timestamp to local time

I have my timestamp in UTC and ISO8601, but using Structured Streaming, it gets automatically converted into the local time. Is there a way to stop this conversion? I would like to have it in UTC. I'm reading json data from Kafka and then parsing…
22
votes
2 answers

Why does Complete output mode require aggregation?

I work with the latest Structured Streaming in Apache Spark 2.2 and got the following exception: org.apache.spark.sql.AnalysisException: Complete output mode not supported when there are no streaming aggregations on streaming …
Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
21
votes
8 answers

Multiple aggregations in Spark Structured Streaming

I would like to do multiple aggregations in Spark Structured Streaming. Something like this: Read a stream of input files (from a folder) Perform aggregation 1 (with some transformations) Perform aggregation 2 (and more transformations) When I…
18
votes
2 answers

How to start multiple streaming queries in a single Spark application?

I have built few Spark Structured Streaming queries to run on EMR, they are long running queries, and need to run at all times, since they are all ETL type queries, when I submit a job to YARN cluster on EMR, I can submit a single spark application.…
N_C
  • 952
  • 8
  • 17
17
votes
2 answers

Spark structured streaming app reading from multiple Kafka topics

I have a Spark structured streaming app (v2.3.2) which needs to read from a number of Kafka topics, do some relatively simple processing (mainly aggregations and a few joins) and publishes the results to a number of other Kafka topics. So multiple…
jammann
  • 690
  • 1
  • 6
  • 22
17
votes
5 answers

Spark structured streaming kafka convert JSON without schema (infer schema)

I read Spark Structured Streaming doesn't support schema inference for reading Kafka messages as JSON. Is there a way to retrieve schema the same as Spark Streaming does: val dataFrame = spark.read.json(rdd.map(_.value())) dataFrame.printschema
16
votes
1 answer

How to load streaming data from Amazon SQS?

I use Spark 2.2.0. How can I feed Amazon SQS stream to spark structured stream using pyspark? This question tries to answer it for a non structured streaming and for scala by creating a custom receiver. Is something similar possible in pyspark?…
15
votes
2 answers

How to get the output from console streaming sink in Zeppelin?

I'm struggling to get the console sink working with PySpark Structured Streaming when run from Zeppelin. Basically, I'm not seeing any results printed to the screen, or to any logfiles I've found. My question: Does anyone have a working example of…
m01
  • 9,033
  • 6
  • 32
  • 58
15
votes
4 answers

How to read streaming dataset once and output to multiple sinks?

I have Spark Structured Streaming Job that reads from S3, transforms the data and then store it to one S3 sink and one Elasticsearch sink. Currently, I am doing readStream once and then writeStream.format("").start() twice. When doing so it seems…
s11230
  • 151
  • 1
  • 4
15
votes
1 answer

Executing separate streaming queries in spark structured streaming

I am trying to aggregate stream with two different windows and printing it into the console. However only the first streaming query is being printed. The tenSecsQ is not printed into the console. SparkSession spark = SparkSession .builder() …
atom
  • 153
  • 1
  • 5
15
votes
2 answers

Empty output for Watermarked Aggregation Query in Append Mode

I use Spark 2.2.0-rc1. I've got a Kafka topic which I'm querying a running watermarked aggregation, with a 1 minute watermark, giving out to console with append output mode. import org.apache.spark.sql.types._ val schema =…
himanshuIIITian
  • 5,985
  • 6
  • 50
  • 70
1
2 3
99 100