What happens if an RDD can't fit into memory in Spark?

Question

As far as I know, Spark tries to do all computation in memory, unless you call persist with disk storage option. If however, we don't use any persist, what does Spark do when an RDD doesn't fit in memory? What if we have very huge data. How will Spark handle it without crashing?

You need to understand the difference between the full data set (which could be many times the amount of RAM in your cluster, and the size of your biggest [RDD partition](http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds). An RDD partition is the unit of work for a node and is the thing that needs to fit in memory. Conceivably you could process a Petabyte dataset on a single laptop, if you break it into enough partitions (it might take a long time...). There are functions in Spark to ```repartition``` the RDD to make each partition small enough. — Alister Lee, Sep 16 '15 at 03:38
Also, even though it does a lot of work in memory, it will write the partitions to disk at each shuffle. — Alister Lee, Sep 16 '15 at 03:45

score 11 · Answer 1 · edited Nov 11 '16 at 18:09

11

From Apache Spark FAQ's:

Spark's operators spill data to disk if it does not fit in memory, allowing it to run well on any sized data. Likewise, cached datasets that do not fit in memory are either spilled to disk or recomputed on the fly when needed, as determined by the RDD's storage level.

Refer below link to know more about storage levels and how to choose appropriate one between these levels: programming-guide.html

edited Nov 11 '16 at 18:09

Ram Ghadiyaram

28,239
13
95
121

answered Sep 15 '15 at 09:25

Sachin Gaikwad

1,014
7
9

This recompute on the fly feature is pretty annoying. Its recomputing things even if I use persist with MEMORY_AND_DISK storage level. It seems like it does not do what it says. – MetallicPriest Sep 15 '15 at 10:04
And what if it doesn't fit into the node's disk size either? – TechCrap Feb 11 '19 at 16:19

What happens if an RDD can't fit into memory in Spark?

1 Answers1

Linked