1

I use Spark (in java) to create a RDD of complex object. Is it possible to save permently this object in memory to use again this object with spark in the future ?

(Because Spark after a application or a job clean memory)

Christophe Roussy
  • 16,299
  • 4
  • 85
  • 85
TiGi
  • 37
  • 8

1 Answers1

3

Spark is not intended as a permanent storage, you can use HDFS, ElasticSearch or another 'Spark compatible' cluster storage for this.

Spark reads data from a cluster storage, does some work in random access memory RAM (and optional caching of temp results), then usually writes results back to cluster storage because there may be too many results for the local hard drive.

Example: Read from HDFS -> Spark ... RDD ... -> Store results in HDFS

You must distinguish between slow storage like harddrives (disk, SSD) and fast volatile memory like RAM. The strength of Spark is making heavy use of random access memory (RAM).

You may use caching, for a temporary storage, see: (Why) do we need to call cache or persist on a RDD

Community
  • 1
  • 1
Christophe Roussy
  • 16,299
  • 4
  • 85
  • 85
  • I understand but database are not adapt to store object... So the best solution will be use hdfs to have data in disk and an other database in memory like Tachyon or Redis to benefit speed when spark read data and don't keep object format ? – TiGi Jul 07 '16 at 08:47
  • HDFS works well with Spark, often you do HDFS -> Spark -> HDFS, the thing is you must use something compatible with Spark and it should be able to take large amounts for data, but maybe your Spark output is not as big as the input so this is not always a requirement. – Christophe Roussy Jul 07 '16 at 09:01