I am following along with the code in Apache Spark Definitive Guide. I ran into an issue where the following code does not print result in the Jupyter Notebook when I have the commented line of code, "awaitTermination()". With "awaitTermination()" included in code the Jupyter Kernel is busy and it stays busy for a long time possibly indefinitely.
Without "awaitTermination" the code works fine.
Can someone explain this behavior. How I could overcome this?
static = spark.read.json(r"/resources/activity-data/")
dataSchema = static.schema
streaming = (spark
.readStream
.schema(dataSchema)
.option("maxFilesPerTrigger", 1)
.json(r"/resources/activity-data/")
)
activityCounts = streaming.groupBy("gt").count()
spark.conf.set("spark.sql.shuffle.partitions", 5)
activityQuery = (activityCounts
.writeStream
.queryName("activity_counts")
.format("memory")
.outputMode("complete")
.start()
)
#activityQuery.awaitTermination()
#activityQuery.stop()
from time import sleep
for x in range(5):
spark.table("activity_counts").show()
sleep(1)