How to save/insert each DStream into a permanent table

Question

I've been facing a problem with "Spark Streaming" about the insertion of output Dstream into a permanent SQL table. I'd like to insert every output DStream (coming from single batch that spark processes) into a unique table. I've been using Python with a Spark version 1.6.2.

At this part of my code I have a Dstream made of one or more RDD that i'd like to permanently insert/store into a SQL table without losing any result for each processed batch.

rr = feature_and_label.join(result_zipped)\
                      .map(lambda x: (x[1][0][0], x[1][1]) )

Each Dstream here is represented for instance like this tuple: (4.0, 0). I can't use SparkSQL because the way Spark treats the 'table', that is, like a temporary table, therefore loosing the result at every batch.

This is an example of output:

Time: 2016-09-23 00:57:00

(0.0, 2)

Time: 2016-09-23 00:57:01

(4.0, 0)

Time: 2016-09-23 00:57:02

(4.0, 0)

...

As shown above, each batch is made by only one Dstream. As I said before, I'd like to permanently store these results into a table saved somewhere, and possibly querying it at later time. So my question is: is there a way to do it ?
I'd appreciate whether somebody can help me out with it but especially telling me whether it is possible or not. Thank you.

score 7 · Answer 1 · edited Jun 20 '20 at 09:12

Vanilla Spark does not provide a way to persist data unless you've downloaded the version packaged with HDFS (although they appear to be playing with the idea in Spark 2.0). One way to store the results to a permanent table and query those results later is to use one of the various databases in the Spark Database Ecosystem. There are pros and cons to each and your use case matters. I'll provide something close to a master list. These are segmented by:

Type of data managment, form data is stored in, connection to Spark

Database, SQL, Integrated

SnappyData

Database, SQL, Connector

MemSQL
Hana
Kudu
FiloDB
DB2
SQLServer (JDBC)
Oracle (JDBC)
MySQL (JDBC)

Database, NoSQL, Connector

Database, Document, Connector

Database, Graph, Connector

Search, Document, Connector

Data grid, SQL, Connector

Ignite

Data grid, NoSQL, Connector

File System, Files, Integrated

HDFS

File System, Files, Connector

S3
Alluxio

Datawarehouse, SQL, Connector

thanks for the reply and the master list. I solved it in a fast way using MySQL (JDBC). — Davide Nardone, Sep 29 '16 at 08:00
Just one problem with DB2. It was made for Spark 1.6 . Unfortunately the world is moving on to Spark 2, and there's no compatible replacements for this available — Sparker0i, May 07 '19 at 20:45

score 0 · Answer 2 · answered Dec 27 '17 at 07:01

0

Instead of using external connectors better go for spark structured streaming .

answered Dec 27 '17 at 07:01

hadoop data scientist

1