0

I'm getting out of memory in a loop when I'm adding elements in an immutable set. There are a lot of objects in the set already and I guess it's consuming a lot of memory. I know that while adding elements in immutable collections Scala will first copy the existing collection in a new set, add the element in the new set and will return this new set.

So suppose if my JVM memory is 500mb and the set is consuming 400mb. Now for before adding new element Scala tries to copy old set into a new set (which I think would again consume 400mb again) now at this step, it's already exceeded the JVM memory (total consumed memory 800) and hence it throws out of memory error. The code looks little bit like below

private def getNewCollection(myMuttableSet:Set[MyType]): Set[MyType] = {
myMuttableSet.flatMap(c => {
      val returnedSet = doSomeCalculationsAndreturnASet // this method returns a large collection so duing the loop the collection grows exponentially 
      if (returnedSet.isEmpty) Set.empty[MyType]
      else doSomeCalculationsAndreturnASet + MyType(constArg1,constArg2)  (I have case class of MyType)     
    })
}

Kindly advise if my understanding is correct.

whysoseriousson
  • 196
  • 2
  • 16
  • Could you include your loop code in your question? – hasumedic Nov 19 '18 at 14:56
  • 1
    Your understanding is correct – Dima Nov 19 '18 at 15:04
  • Yes you're correct, but it would be nice to see your code, because if you're doing something like: `var set = Set.empty[Int]; for (i <- 1 to 1000) { set = set + i}`, then, IMHO, would be better to just use a [mutable set](https://www.scala-lang.org/api/current/scala/collection/mutable/Set.html). – Luis Miguel Mejía Suárez Nov 19 '18 at 15:18
  • Preference between `mutable val` vs `immutable var` should depend in part on visibility. See https://stackoverflow.com/a/11386867/3226045 for an explanation on referential transparency. – Shane Perry Nov 19 '18 at 20:33
  • added the algo that is used in the program – whysoseriousson Nov 20 '18 at 06:57
  • Why are you calling `doSomeCalculationsAndreturnASet` if you already have its result stored as `returnedSet`? This might or might not be optimized away by compiler or JIT. I think this might be your memory leak. Set concatenation doesn't copy over the whole Set as you suggested. Please check my answer for details. – Aivean Nov 20 '18 at 08:22

2 Answers2

0

It is not quite as simple as that because it depends on the size of elements in the Set.

Creating a new Set is a shallow operation and does not copy the elements in the set, it just creates a new wrapper (typically a hash table of some sort) pointing to the same objects.

If you have a small set of large objects then duplicating that set might not take much storage because the objects will be shared between the two sets. Most of the memory is used by the objects in the set and these do not need to be copied to create a new set. So your 400Mb might become 450Mb and fit within the memory limit.

If you have a large set of small objects then duplicating that set may double the storage. Most of the memory is used in the Set itself and can't be shared between the original set and the copy. In this case your 400Mb could easily become close to 800Mb.

Since you are running out of memory and you say there are a lot of objects, then it sounds like this is the problem, but we would need to see the code to tell for sure.

Tim
  • 26,753
  • 2
  • 16
  • 29
0

Now for before adding new element Scala tries to copy old set into a new set (which I think would again consume 400mb again) now at this step,

This is not correct.

Immutable collections in scala (including Sets) are implemented as persistent data structures, which usually have a property called "structural sharing". That means, when the structure is updated, it's not fully copied, but instead most of it is reused, with only relatively small part being actually re-created from scratch.

The easiest example to illustrate that is List, which is implemented as a single-linked list, with root pointing to the head.

For example, you have the following code:

val a = List(3,2,1)
val b = 4 :: a
val c = 5 :: b

Although the three lists combined have 3 + 4 + 5 = 12 elements in total, they physically share the nodes, and there are only 5 List nodes.

5 →  4  →  3 →  2  → 1
↑    ↑     ↑
c    b     a   

Similar principle applies to Set. Set in scala is implemented as a HashTrie. I won't go into the details about specifics of a Trie, just think about it as a tree with a high branching factor. Now when that tree is updated, it's not copied completely. Only the nodes that are in the path from the tree root to the new/updated node are copied.

For the HashTrie the depth of the tree can not be more than 7 levels. So, when updating Set in scala you're looking at the memory allocation proportional to O(7 * 32) (7 levels max, each node roughly speaking is an array of 32) in the worst case, regardless of the Set size.


Looking at you code, you have following things in memory:

  1. myMuttableSet is present until getNewCollection returns
  2. myMuttableSet.flatMap creates mutable buffer underneath. Also, after flatMap is done, buffer.result will copy the content of the mutable buffer over to immutable set. So there is actually a brief moment when two sets exist.
  3. on every step of flatMap, returnedSet also retains the memory.

Side note: why are you calling doSomeCalculationsAndreturnASet again if you already have it's result cached in the returnedSet? Could it be the root of the problem?

So, at any given point of time you have in memory (whichever is larger):

  • myMuttableSet + mutable result set buffer + returnedSet + (another?) result doSomeCalculationsAndreturnASet
  • myMuttableSet + mutable result set buffer + immutable result set

To conclude, whatever your problems with memory are, adding the element to the Set most probably is not the culprit. My suggestion would be to suspend you program in debugger and use any profiler (such as VisualVM) to make heap dumps at different stages.

Aivean
  • 10,692
  • 25
  • 39