1

Data Frames in Apache spark use off heap memory for storing data. Whats the main purpose of using off heap memory? What I understand currently is it's beneficial to store large objects(mutable or immutable objects??) so that it doesn't require us to use a larger java heap space. Using large java heap space slows down the application because of how the java Garbage collector works..

The above is what I've understood.. Can someone please help me put together the pieces..

Hemanth Gowda
  • 604
  • 4
  • 16
  • That is basically it. Off heap can be much larger but is harder to work with unless you have a library which hides the details of using it. (And I have written a few ;) – Peter Lawrey Apr 29 '18 at 23:32
  • Thanks for confirming! I get the general idea of it but I'm trying to understand what exactly project Tungsten is. Is it only a better way for serialization or does it have more advantages? Can data frames work directly on serialized data without de-serializing it? Why can't we achieve the same with RDDs? – Hemanth Gowda Apr 30 '18 at 01:41
  • the first few links seem to cover this https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html https://databricks.com/session/deep-dive-into-project-tungsten-bringing-spark-closer-to-bare-metal https://www.slideshare.net/mobile/databricks/project-tungsten-phase-ii-joining-a-billion-rows-per-second-on-a-laptop – Peter Lawrey Apr 30 '18 at 06:02
  • That's great detail! Really helps. Thank you so much for your time :) – Hemanth Gowda Apr 30 '18 at 07:45
  • [DeepDiveIntoSparkStorageFormats](https://spoddutur.github.io/spark-notes/deep_dive_into_storage_formats.html) This gave me a perfect picture! – Hemanth Gowda May 01 '18 at 23:48

0 Answers0