Questions tagged [spark-structured-streaming]

Spark Structured Streaming allows processing live data streams using DataFrame and Dataset APIs.

Spark Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing with the Dataset/DataFrame APIs available in Python, R (in both sparkr and sparklyr) Scala and Java. Structured streaming is for Spark 2.x and is not to be confused with Spark Streaming which is for Spark 1.x.

External resources:

Integrating Spark Structured Streaming with the Confluent Schema Registry

I'm using a Kafka Source in Spark Structured Streaming to receive Confluent encoded Avro records. I intend to use Confluent Schema Registry, but the integration with spark structured streaming seems to be impossible. I have seen this question, but…

apache-spark apache-kafka avro confluent-schema-registry spark-structured-streaming

asked Feb 20 '18 at 10:12

Souhaib Guitouni

votes

3 answers

How to get Kafka offsets for structured query for manual and reliable offset management?

Spark 2.2 introduced a Kafka's structured streaming source. As I understand, it's relying on HDFS checkpoint directory to store offsets and guarantee an "exactly-once" message delivery. But old docks (like…

apache-spark apache-kafka apache-spark-sql offset spark-structured-streaming

asked Sep 11 '17 at 10:07

dnaumenko

votes

8 answers

Why does format("kafka") fail with "Failed to find data source: kafka." (even with uber-jar)?

I use HDP-2.6.3.0 with Spark2 package 2.2.0. I'm trying to write a Kafka consumer, using the Structured Streaming API, but I'm getting the following error after submit the job to the cluster: Exception in thread "main"…

apache-spark apache-spark-sql spark-structured-streaming uberjar

asked Dec 28 '17 at 17:41

Kleyson Rios

2,597
5
40
65

votes

8 answers

Why does Spark application fail with “ClassNotFoundException: Failed to find data source: kafka” as uber-jar with sbt assembly?

I'm trying to run a sample like StructuredKafkaWordCount. I started with the Spark Structured Streaming Programming guide. My code is package io.boontadata.spark.job1 import org.apache.spark.sql.SparkSession object DirectKafkaAggregateEvents { …

scala apache-spark sbt sbt-assembly spark-structured-streaming

asked Dec 23 '16 at 14:13

benjguin

1,496
1
12
21

votes

4 answers

Spark Strutured Streaming automatically converts timestamp to local time

I have my timestamp in UTC and ISO8601, but using Structured Streaming, it gets automatically converted into the local time. Is there a way to stop this conversion? I would like to have it in UTC. I'm reading json data from Kafka and then parsing…

java scala apache-spark apache-spark-sql spark-structured-streaming

asked Feb 13 '18 at 12:37

Martin Brisiak

3,872
12
37
51

votes

2 answers

Why does Complete output mode require aggregation?

I work with the latest Structured Streaming in Apache Spark 2.2 and got the following exception: org.apache.spark.sql.AnalysisException: Complete output mode not supported when there are no streaming aggregations on streaming …

apache-spark spark-structured-streaming

asked Aug 18 '17 at 12:43

Jacek Laskowski

72,696
27
242
420

votes

8 answers

Multiple aggregations in Spark Structured Streaming

I would like to do multiple aggregations in Spark Structured Streaming. Something like this: Read a stream of input files (from a folder) Perform aggregation 1 (with some transformations) Perform aggregation 2 (and more transformations) When I…

apache-spark apache-spark-sql spark-structured-streaming

asked Dec 07 '16 at 06:42

Kaptrain

votes

2 answers

How to start multiple streaming queries in a single Spark application?

I have built few Spark Structured Streaming queries to run on EMR, they are long running queries, and need to run at all times, since they are all ETL type queries, when I submit a job to YARN cluster on EMR, I can submit a single spark application.…

apache-spark spark-structured-streaming

asked Oct 11 '18 at 14:17

N_C

votes

2 answers

Spark structured streaming app reading from multiple Kafka topics

I have a Spark structured streaming app (v2.3.2) which needs to read from a number of Kafka topics, do some relatively simple processing (mainly aggregations and a few joins) and publishes the results to a number of other Kafka topics. So multiple…

apache-spark apache-kafka spark-structured-streaming

asked Apr 30 '19 at 22:41

jammann

votes

5 answers

Spark structured streaming kafka convert JSON without schema (infer schema)

I read Spark Structured Streaming doesn't support schema inference for reading Kafka messages as JSON. Is there a way to retrieve schema the same as Spark Streaming does: val dataFrame = spark.read.json(rdd.map(_.value())) dataFrame.printschema

apache-spark apache-kafka schema spark-structured-streaming

asked Jan 20 '18 at 21:15

Arnon Rodman

votes

1 answer

How to load streaming data from Amazon SQS?

I use Spark 2.2.0. How can I feed Amazon SQS stream to spark structured stream using pyspark? This question tries to answer it for a non structured streaming and for scala by creating a custom receiver. Is something similar possible in pyspark?…

apache-spark amazon-sqs apache-spark-sql spark-structured-streaming

asked Dec 28 '17 at 12:56

OSK

votes

2 answers

How to get the output from console streaming sink in Zeppelin?

I'm struggling to get the console sink working with PySpark Structured Streaming when run from Zeppelin. Basically, I'm not seeing any results printed to the screen, or to any logfiles I've found. My question: Does anyone have a working example of…

apache-spark pyspark apache-zeppelin spark-structured-streaming

asked Nov 17 '17 at 18:51

m01

9,033
6
32
58

votes

4 answers

How to read streaming dataset once and output to multiple sinks?

I have Spark Structured Streaming Job that reads from S3, transforms the data and then store it to one S3 sink and one Elasticsearch sink. Currently, I am doing readStream once and then writeStream.format("").start() twice. When doing so it seems…

apache-spark spark-structured-streaming

asked Sep 19 '17 at 08:09

s11230

votes

1 answer

Executing separate streaming queries in spark structured streaming

I am trying to aggregate stream with two different windows and printing it into the console. However only the first streaming query is being printed. The tenSecsQ is not printed into the console. SparkSession spark = SparkSession .builder() …

apache-spark spark-structured-streaming

asked Aug 10 '17 at 16:01

atom

votes

2 answers

Empty output for Watermarked Aggregation Query in Append Mode

I use Spark 2.2.0-rc1. I've got a Kafka topic which I'm querying a running watermarked aggregation, with a 1 minute watermark, giving out to console with append output mode. import org.apache.spark.sql.types._ val schema =…

scala apache-spark spark-structured-streaming

asked Jun 07 '17 at 04:45

himanshuIIITian

5,985
6
50
70

2 3

…

99 100 Next