0

I have data which don't fit in memory. So, I was reading in the followings links:

https://stackoverflow.com/a/32582416/9759150

https://stackoverflow.com/a/29518749/9759150

the previous ones related to this: https://spark.apache.org/faq.html

According with the reading Spark writes on disk if data don't fit in memory. But I want to avoid the writing on disk. So I want to know if I can determine how many times do I need to iterate over the data to process it only on memory. Can I do this? How?

diens
  • 639
  • 8
  • 26
  • 1
    What types of operations are you looking to do? Many tasks will require iterating over the entire dataset. Why might I ask are you looking to avoid writing to disk? – alta Jun 29 '18 at 16:35
  • 1
    https://stackoverflow.com/questions/41661849/spill-to-disk-and-shuffle-write-spark – vaquar khan Jun 29 '18 at 21:16

1 Answers1

2

This is pretty difficult to deterministically find the exact number of time you need to iterate over the dataset.

After you read the data from the disk and cache, spark will materialize the dataset and represent that in memory using tungsten format.

Now what will be the size of the dataset in memory that depends on the data type of various columns of your dataset. Also due to deserialization of the data, it will take more memory than the serialized disk data.

In my experience it is generally 3-4X memory requires to fit the parquet disk data into memory. So if you have 50G data in HDFS in parquet , probably you need around 200G memory in the cluster to cache the complete data.

You need to do a trial and error before coming up to a perfect number here.

Avishek Bhattacharya
  • 6,534
  • 3
  • 34
  • 53