So I have just started using Scala and have the following code to create an IndexedSeq of dummy data called out
. The dummy data consists of 20000 tuples each containing a 36 character unique identifier and a list of 1000 floats.
import scala.util.Random
def uuid = java.util.UUID.randomUUID.toString
def generateRandomList(size: Int): List[Float] = {
List.fill(size)(Random.nextFloat)
}
val numDimensions = 1000
val numberToWrite = 20000
val out = for ( i <- 1 to numberToWrite) yield {
val randomList = generateRandomList(numDimensions)
(uuid, randomList) // trying tuples insread
}
But when I run the last statement (just by copying and pasting into the Scala shell) I get the following error:
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.Float.valueOf(Float.java:433)
at scala.runtime.BoxesRunTime.boxToFloat(BoxesRunTime.java:73)
at $anonfun$generateRandomArray$1.apply(<console>:14)
at scala.collection.generic.GenTraversableFactory.fill(GenTraversableFactory.scala:90)
at .generateRandomArray(<console>:14)
at $anonfun$1.apply(<console>:17)
at $anonfun$1.apply(<console>:16)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
... 20 elided
Which is explained as a Java exception that occurs when most of my time is spent doing garbage collection (GC) [1].
According to [2], a 36 char string should take about 112 Bytes. Float takes 4 bytes. I have 1000 in my inner list so about 4000bytes in total. So ignoring the list and tuple overhead then each element of my out
IndexedSeq will be about ~4200 bytes say. So having 20000 means ~84e6 bytes overall.
With this in mind after the exception I run this (taken from [3]):
scala> val heapSize = Runtime.getRuntime().totalMemory(); // Get current size of heap in bytes
heapSize: Long = 212860928
scala> val heapMaxSize = Runtime.getRuntime().maxMemory(); // Get maximum size of heap in bytes. The heap cannot grow beyond this size.// Any attempt will result in an OutOfMemoryException.
heapMaxSize: Long = 239075328
scala> val heapFreeSize = Runtime.getRuntime().freeMemory(); // Get amount of free memory within the heap in bytes. This size will increase // after garbage collection and decrease as new objects are created.
heapFreeSize: Long = 152842176
Although it seems that my max heap size available is greater than the rough amount of memory I think I need, I try increasing the heap size ([4]), via ./scala -J-Xmx2g
. And although this solves my problem it would be good to know if there is a better way to create this random data that avoids me having to increase the memory available to the JVM?
I therefore have these three questions, which I would be grateful if someone could answer:
When does garbage collection occur in Scala, and in particular the Scala shell? In my commands above what is there that can get collected and so why is the GC being called (sorry this second part probably shows my lack of knowledge about the GC) ?
Are my rough calculations of the amount of memory I am taking up approximatley valid (sure I expect a bit more overhead for the list and tuples but am assuming relatively not that much)? If so why do I run out of memory when my max heap size (239e6 bytes) should cover this? And if not what extra memory am I using?
Is there a better way to create random data for this? For context I am trying to just create some dummy data that I can parallelise into Spark (using sc.parallelize) and then play around with. (so to get it to work when I moved to trying it in Spark I increased the driver memory by setting
spark.driver.memory 2g
in my spark conf rather than the-J-Xmx2g
command above).
Thanks for your help!