0

I'm quit new to Spark and was trying to understand it's functionality. Basically I'm from database background, and was confused with Spark databases & tables. So my confusion is does spark also stores data permanently on it's own and make it available all the time as RDBMS or other no-sql store does ? Or it just create a reference point to the incoming data till the duration of processing and once process is over data went off. SO basically how spark is being utilized where we've to process data on regularly in batches or in continuous streaming. What is the time to live for data in spark tables ?

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Sandie
  • 869
  • 2
  • 12
  • 22

1 Answers1

0

Spark is not a database. It does not store data permanently by itself. Its a cluster computing framwork/engine which can also work in a standalone environment. What spark exactly does is it pulls the data from various sources like HDFS,S3,local filesystem,rdbms,nosql etc... and do any analysis or transformation in the memory(RAM) of various worker nodes. It has the capability to spill the data to local disk if the data does not fit in the RAM. Once action is finished the data will be flushed out. Though you can cache or persists and it will available till the spark context is running, sometimes even if you cache the data and the memory is full it calculates the LRU(least recently used) rdd and flush it out for storing other rdd. The memory management is an interesting concept in spark.

stevel
  • 12,567
  • 1
  • 39
  • 50
Chandan Ray
  • 2,031
  • 1
  • 10
  • 15