I'm trying to decide how best to design a data pipeline that will involve Spark Streaming.
The essential process I imagine is:
- Set up a streaming job that watches a fileStream (this is the consumer)
- Do a bunch of computation elsewhere, which populates that file (this is the producer)
- The streaming job consumes the data as it comes in, performing various actions
- When the producer is done, wait for all the streaming computations to finish, and tear down the streaming job.
It's step (4) that has me confused. I'm not sure how to shut it down gracefully. Recommendations I've found generally seem to recommend "Ctrl-C" on the driver, along with the spark.streaming.stopGracefullyOnShutdown
config setting
I don't like that approach since it requires the producing code to somehow access the consumer's driver and send it a signal. These two systems could be completely unrelated; this is not necessarily easy to do. Plus, there is already a communication channel — the fileStream — can't I use that?
In a traditional threaded producer/consumer situation, one common technique is to use a "poison pill". The producer sends a special piece of data indicating "no more data", then you wait for your consumers to exit.
Is there a reason this can't be done in Spark?
Surely there is a way for the stream processing code, upon seeing some special data, to send a message back to its driver?
The Spark docs have an example of listening to a socket, with socketTextStream
, and it somehow is able to terminate when the producer is done. I haven't dived into that code yet, but this seems like it should be possible.
Any advice? Is this fundamentally wrong-headed?