2

I am recently working in spark and came across few queries which I still couldn't resolve.

Let's say i have a dataset of 100GB and my ram size of the cluster is 16 GB.

Now, I know in case of simply reading the file and saving it in the HDFS will work as Spark will do it for each partition. What will happen when I perform sorting or aggregation transformation on 100GB data? How will it process 100GB in memory since we need entire data in case of sorting?

I have gone through below link but this only tells us what spark do in case of persisting, what I am looking is Spark aggregations or sorting on dataset greater than ram size.

Spark RDD - is partition(s) always in RAM?

Any help is appreciated.

dbustosp
  • 4,208
  • 25
  • 46
salmanbw
  • 1,301
  • 2
  • 17
  • 23
  • Spark spills content to disk when memory is used up (well, you can change default config...). You can see this info in the Storage tab of the UI. – ernest_k Apr 17 '18 at 20:56

2 Answers2

6

There are 2 things you might want to know.

  1. Once Spark reaches the memory limit, it will start spilling data to disk. Please check this Spark faq and also there are severals question from SO talking about the same, for example, this one.
  2. There is an algorihtm called external sort that allows you to sort datasets which do not fit in memory. Essentially, you divide the large dataset by chunks which actually fit in memory, sort each chunk and write each chunk to disk. Finally, merge every sorted chunk in order to get the whole dataset sorted. Spark supports external sorting as you can see here and here is the implementation.

Answering your question, you do not really need that your data fit in memory in order to sort it, as I explained to you before. Now, I would encourage you to think about an algorithm for data aggregation dividing the data by chunks, just like external sort does.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
dbustosp
  • 4,208
  • 25
  • 46
  • I will look into external sorting in detail in sometime but before that, let's say if the dataset doesn't fit into memory. spark will spill it into disk but how it will keep tracker of the {key,values} for that partition which is further needed for aggregation or sorting. – salmanbw Apr 18 '18 at 07:13
  • @salmanbw You do not need to track anything. For aggregations, Spark will use `combineByKey()` behind the scene which actually uses the **combiner**, feature from MapReduce. Please check this question: https://stackoverflow.com/questions/24804619/how-does-spark-aggregate-function-aggregatebykey-work. – dbustosp Apr 18 '18 at 21:36
0

There are multiple things you need to consider. Because you have 16RAM and 100GB data set, it will be good idea to keep persistence in DISK. It maybe difficult as when aggregating if data set has high cardinality. If the cardinality is low you will be better of to do aggregate at each RDD before merging into whole dataset. Also remember to make sure that each partition in RDD is less than memory (default value 0.4*container_size)

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
BalaramRaju
  • 439
  • 2
  • 8