0

I just began exploring Apache Spark (following A gentle introduction to Apache Spark) under Windows 10, using pyspark. I got to the chapter about Structured Streaming, and I'm having a bit of trouble with cmd - whenever I start a stream, the cmd window becomes unusable, because Spark keeps "typing" stuff so even if I do type anything, it quickly disappears.

My code (taken directly from the book):

from pyspark.sql.functions import window, column, desc, col

staticDataFrame = spark.read.format("csv")\
.option("header", "true")\
.option("inferSchema", "true")\
.load("./data/retail-data/by-day/*.csv")

staticSchema = staticDataFrame.schema

streamingDataFrame = spark.readStream\
.schema(staticSchema)\
.option("maxFilesPerTrigger", 1)\
.format("csv")\
.option("header", "true")\
.load("./data/retail-data/by-day/*.csv")

purchaseByCustomerPerHour = streamingDataFrame\
.selectExpr(
"CustomerId",
"(UnitPrice * Quantity) as total_cost" ,
"InvoiceDate" )\
.groupBy(
col("CustomerId"), window(col("InvoiceDate"), "1 day"))\
.sum("total_cost")

purchaseByCustomerPerHour.writeStream\
.format("memory")\
.option('checkpointLocation','F:/Spark/sparktmp')\
.queryName("customer_purchases")\
.outputMode("complete")\
.start()

The issue I'm talking about:

My issue with cmd

The caret should be at the line where [Stage 6:======>] is. So if I wanted to query the stream (like the book suggests) I am unable to. And I cannot just open a second pyspark shell, since that would be a different Spark session. I'm also not sure whether the streaming job should start over when it exhausts all the input files ones (which it does) but I guess that's a topic for a different question.

Let me know if I should provide more information. Thank you in advance!

zero323
  • 322,348
  • 103
  • 959
  • 935
Teodor Savov
  • 71
  • 1
  • 7

1 Answers1

1

TL;DR; Just use notebook environment. Jupyter Notebook (optionally with Apache Toree kernel) or Apache Zeppelin Notebook, will work just fine and won't capture the output (this might be undesired How to get the output from console streaming sink in Zeppelin?), and allow you to make uninterrupted queries.

In standard shell setting spark.ui.showConsoleProgress to false can also help a bit:

bin/pyspark --conf "spark.ui.showConsoleProgress=false"
Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115