Now for before adding new element Scala tries to copy old set into a new set (which I think would again consume 400mb again) now at this step,
This is not correct.
Immutable collections in scala (including Sets
) are implemented as persistent data structures, which usually have a property called "structural sharing". That means, when the structure is updated, it's not fully copied, but instead most of it is reused, with only relatively small part being actually re-created from scratch.
The easiest example to illustrate that is List
, which is implemented as a single-linked list, with root pointing to the head.
For example, you have the following code:
val a = List(3,2,1)
val b = 4 :: a
val c = 5 :: b
Although the three lists combined have 3 + 4 + 5 = 12 elements in total, they physically share the nodes, and there are only 5 List
nodes.
5 → 4 → 3 → 2 → 1
↑ ↑ ↑
c b a
Similar principle applies to Set
. Set
in scala is implemented as a HashTrie. I won't go into the details about specifics of a Trie, just think about it as a tree with a high branching factor. Now when that tree is updated, it's not copied completely. Only the nodes that are in the path from the tree root to the new/updated node are copied.
For the HashTrie
the depth of the tree can not be more than 7 levels. So, when updating Set
in scala you're looking at the memory allocation proportional to O(7 * 32)
(7 levels max, each node roughly speaking is an array of 32) in the worst case, regardless of the Set size.
Looking at you code, you have following things in memory:
myMuttableSet
is present until getNewCollection
returns
myMuttableSet.flatMap
creates mutable buffer underneath. Also, after flatMap
is done, buffer.result
will copy the content of the mutable buffer over to immutable set. So there is actually a brief moment when two sets exist.
- on every step of
flatMap
, returnedSet
also retains the memory.
Side note: why are you calling doSomeCalculationsAndreturnASet
again if you already have it's result cached in the returnedSet
? Could it be the root of the problem?
So, at any given point of time you have in memory (whichever is larger):
myMuttableSet
+ mutable result set buffer
+ returnedSet
+ (another?) result doSomeCalculationsAndreturnASet
myMuttableSet
+ mutable result set buffer
+ immutable result set
To conclude, whatever your problems with memory are, adding the element to the Set most probably is not the culprit. My suggestion would be to suspend you program in debugger and use any profiler (such as VisualVM) to make heap dumps at different stages.