How can I make Spark DataSet streamed to memory accessible to another spark application?

Question

I have a Java application that acts as driver application for Spark. It does some data processing and streams a subset of data to memory.

Sample Code:

    ds.writeStream()
    .format("memory")
    .queryName("orderdataDS")
    .start();

Now I need another python application to access this dataset(orderdataDS).

How can this be accomplished?

Alper t. Turker · Answer 1 · 2017-09-07T06:45:40.647

0

You cannot, unless both applications share the same JVM driver process (like Zeppelin). If you want data to be shared between multiple applications, please use independent store, like RDBMS.

Overall memory sink is not intended for production:

This should be used for debugging purposes on low data volumes as the entire output is collected and stored in the driver’s memory

edited Sep 07 '17 at 06:45

answered Sep 06 '17 at 13:25

Alper t. Turker

34,230
9
83
115

Global Temporary View Temporary views in Spark SQL are session-scoped and will disappear if the session that creates it terminates. If you want to have a temporary view that is shared among all sessions and keep alive until the Spark application terminates, you can create a global temporary view. Global temporary view is tied to a system preserved database global_temp, and we must use the qualified name to refer it, e.g. SELECT * FROM global_temp.view1. Scala Java Python Sql – Jagannadh V Sep 07 '17 at 06:08
If the data that I have persisted is in global temp view and the driver that has created that data is still alive, can I query it using syntax like SELECT * FROM global_temp.viewname? – Jagannadh V Sep 07 '17 at 06:09

plamb · Answer 2 · 2017-09-07T15:03:44.273

0

To build upon the above answer, Spark was not built with concurrency in mind. Like what the answerer suggests, you need to back Spark with a "state store" like a RDBMS. There are a large number of options when you go to do this. I've detailed the majority of them here

edited Sep 07 '17 at 15:03

answered Sep 06 '17 at 15:57

plamb

5,636
1
18
31

_"Jobs are scheduled and happen one at a time."_ that's certainly not correct. See http://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application – Jacek Laskowski Sep 06 '17 at 23:13
Sorry that was an oversimplification. Strictly speaking any single application can run multiple jobs but two concurrent users cannot run their own program simultaneously on the same cluster. If user 1 caches the result of his job user 2 cannot run another program and get access to the cached result. User 2 will require an independent set of resources (executors) and have no visibility to the cache. – plamb Sep 07 '17 at 16:36
_"but two concurrent users cannot run their own program simultaneously on the same cluster."_ I thought that's the goal of **any** Spark cluster, e.g. Mesos, YARN or even Spark Standalone. The rest is correct. – Jacek Laskowski Sep 08 '17 at 06:24

How can I make Spark DataSet streamed to memory accessible to another spark application?

Sample Code:

2 Answers2