1

I would like to store result of continuous queries running against streaming data in such a manner so that results are persisted into distributed nodes to ensure failover and scalability.

Can Spark SQL experts please shed some light on - (1) which storage option I should choose so that OLAP queries are faster - (2) how to ensure data available for query even if one node is down - (3) internally how does Spark SQL store the resultset ?

Thanks Kaniska

kaniska Mandal
  • 189
  • 1
  • 12

1 Answers1

1

It depends what kind of latency you can afford.

  • One way is to persist the result into HDFS/Cassandra using Persist() API. If your data is small then cache() of each RDD should give you a good result.

  • Store where your spark executors are co-located. For example:

    • It is also possible to use Memory based storage like tachyon to persist your stream (i.e. each RDD of your stream) and query against it.
    • If latency is not an issue then persist(MEMORY_OR_DISK_2) should give you what you need. Mind you performance is a hit or miss in that scenario. Also this stores the data in two executors.
  • In other cases if your clients are more comfortable in OLTP like database where they just need to query the constantly updating result you can use conventional database like postgres or mysql. This is a preferred method among many as query time is consistent and predictable. If the result is not update heavy but partitioned (say by time) then Greenplum like systems are also a choice.
Manas
  • 519
  • 4
  • 14
  • To be more specific, if I use **writeStream .format("parquet").option("path", "path/to/destination/dir")** , then will the edits in parquet file be replicated in **all the nodes in cluster** automatically ? need both **fast query on output sink and failover, replication** – kaniska Mandal Jun 17 '17 at 18:14
  • No it is not necessary that it will be replicated to ALL NODES because of writing to `path/to/destination/dir`. It all depends how many partitions you have. To take things to your hand and make sure all executors get some bits of data you can repartition the data with your own partitioning logic which will make sure all machines have the part of the data. – Manas Jun 18 '17 at 17:51