As far as I know, Spark tries to do all computation in memory, unless you call persist with disk storage option. If however, we don't use any persist, what does Spark do when an RDD doesn't fit in memory? What if we have very huge data. How will Spark handle it without crashing?
Asked
Active
Viewed 5,667 times
6
-
4You need to understand the difference between the full data set (which could be many times the amount of RAM in your cluster, and the size of your biggest [RDD partition](http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds). An RDD partition is the unit of work for a node and is the thing that needs to fit in memory. Conceivably you could process a Petabyte dataset on a single laptop, if you break it into enough partitions (it might take a long time...). There are functions in Spark to ```repartition``` the RDD to make each partition small enough. – Alister Lee Sep 16 '15 at 03:38
-
Also, even though it does a lot of work in memory, it will write the partitions to disk at each shuffle. – Alister Lee Sep 16 '15 at 03:45
1 Answers
11
From Apache Spark FAQ's:
Spark's operators spill data to disk if it does not fit in memory, allowing it to run well on any sized data. Likewise, cached datasets that do not fit in memory are either spilled to disk or recomputed on the fly when needed, as determined by the RDD's storage level.
Refer below link to know more about storage levels and how to choose appropriate one between these levels: programming-guide.html

Ram Ghadiyaram
- 28,239
- 13
- 95
- 121

Sachin Gaikwad
- 1,014
- 7
- 9
-
This recompute on the fly feature is pretty annoying. Its recomputing things even if I use persist with MEMORY_AND_DISK storage level. It seems like it does not do what it says. – MetallicPriest Sep 15 '15 at 10:04
-