7

I have a Spark Structured Streaming job, it reads the offsets from a Kafka topic and writes it to the aerospike database. Currently I am in the process making this job production ready and implementing the SparkListener.

While going to through the documentation I stumbled upon this example:

StreamingQuery query = wordCounts.writeStream()
    .outputMode("complete")
    .format("console")
    .start();
query.awaitTermination();

After this code is executed, the streaming computation will have
started in the background. The query object is a handle to that active
streaming query, and we have decided to wait for the termination of
the query using awaitTermination() to prevent the process from exiting
while the query is active.

I understand that it waits for query to complete before terminating the process.

What does it mean exactly? It helps to avoid data loss written by the query.

How is it helpful when query is writing millions of records every day?

My code looks pretty simple though:

dataset.writeStream()
  .option("startingOffsets", "earliest")
  .outputMode(OutputMode.Append())
  .format("console")
  .foreach(sink)
  .trigger(Trigger.ProcessingTime(triggerInterval))
  .option("checkpointLocation", checkpointLocation)
  .start();
Shaido
  • 27,497
  • 23
  • 70
  • 73
Himanshu Yadav
  • 13,315
  • 46
  • 162
  • 291

2 Answers2

14

There are quite a few questions here, but answering just the one below should answer all.

I understand that it waits for query to complete before terminating the process. What does it mean exactly?

A streaming query runs in a separate daemon thread. In Java, daemon threads are used to allow for parallel processing until the main thread of your Spark application finishes (dies). Right after the last non-daemon thread finishes, the JVM shuts down and the entire Spark application finishes.

That's why you need to keep the main non-daemon thread waiting for the other daemon threads so they can do their work.

Read up on daemon threads in What is a daemon thread in Java?

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
2

I understand that it waits for query to complete before terminating the process. What does it mean exactly

Nothing more, nothing less. Since query is started in background, without explicit blocking instruction your code would simply reach the end of main function and exit immediately.

How is it helpful when query is writing millions of records every day?

It really doesn't. It instead ensure that query is execute at all.