I've been facing a problem with "Spark Streaming" about the insertion of output Dstream into a permanent SQL table. I'd like to insert every output DStream (coming from single batch that spark processes) into a unique table. I've been using Python with a Spark version 1.6.2.
At this part of my code I have a Dstream made of one or more RDD that i'd like to permanently insert/store into a SQL table without losing any result for each processed batch.
rr = feature_and_label.join(result_zipped)\
.map(lambda x: (x[1][0][0], x[1][1]) )
Each Dstream here is represented for instance like this tuple: (4.0, 0). I can't use SparkSQL because the way Spark treats the 'table', that is, like a temporary table, therefore loosing the result at every batch.
This is an example of output:
Time: 2016-09-23 00:57:00
(0.0, 2)
Time: 2016-09-23 00:57:01
(4.0, 0)
Time: 2016-09-23 00:57:02
(4.0, 0)
...
As shown above, each batch is made by only one Dstream. As I said before, I'd like to permanently store these results into a table saved somewhere, and possibly querying it at later time. So my question is:
is there a way to do it ?
I'd appreciate whether somebody can help me out with it but especially telling me whether it is possible or not.
Thank you.