We have the use case in Spark where we want to load historical data from our database to Spark and keep on adding new streaming data to Spark, then we can do analysis on the whole up-to-date dataset.
As far as I know, neither Spark SQL nor Spark Streaming can combine the historical data with the streaming data. Then I found the Structured Streaming in Spark 2.0 which seems to be built for this problem. But after some experiment, I still cannot figure it out. Here are my codes:
SparkSession spark = SparkSession
.builder()
.config(conf)
.getOrCreate();
JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
// Load historical data from MongoDB
JavaMongoRDD<Document> mongordd = MongoSpark.load(jsc);
// Create typed dataset with customized schema
JavaRDD<JavaRecordForSingleTick> rdd = mongordd.flatMap(new FlatMapFunction<Document, JavaRecordForSingleTick>() {...});
Dataset<Row> df = spark.sqlContext().createDataFrame(rdd, JavaRecordForSingleTick.class);
Dataset<JavaRecordForSingleTick> df1 = df.as(ExpressionEncoder.javaBean(JavaRecordForSingleTick.class));
// ds listens to a streaming data source
Dataset<Row> ds = spark.readStream()
.format("socket")
.option("host", "127.0.0.1")
.option("port", 11111)
.load();
// Create the typed dataset with customized schema
Dataset<JavaRecordForSingleTick> ds1 = ds
.as(Encoders.STRING())
.flatMap(new FlatMapFunction<String, JavaRecordForSingleTick>() {
@Override
public Iterator<JavaRecordForSingleTick> call(String str) throws Exception {
...
}
}, ExpressionEncoder.javaBean(JavaRecordForSingleTick.class));
// ds1 and df1 have the same schema. ds1 gets data from the streaming data source, df1 is the dataset with historical data
ds1 = ds1.union(df1);
StreamingQuery query = ds1.writeStream().format("console").start();
query.awaitTermination();
I got an error "org.apache.spark.sql.AnalysisException: Union between streaming and batch DataFrames/Datasets is not supported;" when I union() two datasets.
Could anyone please help me? Am I going to the wrong direction?