1

So I have just started using Scala and have the following code to create an IndexedSeq of dummy data called out. The dummy data consists of 20000 tuples each containing a 36 character unique identifier and a list of 1000 floats.

import scala.util.Random

def uuid = java.util.UUID.randomUUID.toString

def generateRandomList(size: Int): List[Float] = {
    List.fill(size)(Random.nextFloat)
}

val numDimensions = 1000
val numberToWrite = 20000

val out = for ( i <- 1 to numberToWrite) yield {
      val randomList = generateRandomList(numDimensions)
      (uuid, randomList)  // trying tuples insread
}

But when I run the last statement (just by copying and pasting into the Scala shell) I get the following error:

java.lang.OutOfMemoryError: GC overhead limit exceeded
  at java.lang.Float.valueOf(Float.java:433)
  at scala.runtime.BoxesRunTime.boxToFloat(BoxesRunTime.java:73)
  at $anonfun$generateRandomArray$1.apply(<console>:14)
  at scala.collection.generic.GenTraversableFactory.fill(GenTraversableFactory.scala:90)
  at .generateRandomArray(<console>:14)
  at $anonfun$1.apply(<console>:17)
  at $anonfun$1.apply(<console>:16)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.Range.foreach(Range.scala:160)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.AbstractTraversable.map(Traversable.scala:104)
  ... 20 elided

Which is explained as a Java exception that occurs when most of my time is spent doing garbage collection (GC) [1].

According to [2], a 36 char string should take about 112 Bytes. Float takes 4 bytes. I have 1000 in my inner list so about 4000bytes in total. So ignoring the list and tuple overhead then each element of my out IndexedSeq will be about ~4200 bytes say. So having 20000 means ~84e6 bytes overall.

With this in mind after the exception I run this (taken from [3]):

scala> val heapSize = Runtime.getRuntime().totalMemory(); // Get current size of heap in bytes
heapSize: Long = 212860928

scala> val heapMaxSize = Runtime.getRuntime().maxMemory(); // Get maximum size of heap in bytes. The heap cannot     grow beyond this size.// Any attempt will result in an OutOfMemoryException.

heapMaxSize: Long = 239075328

scala> val heapFreeSize = Runtime.getRuntime().freeMemory();  // Get amount of free memory within the heap in bytes.     This size will increase // after garbage collection and decrease as new objects are created.

heapFreeSize: Long = 152842176

Although it seems that my max heap size available is greater than the rough amount of memory I think I need, I try increasing the heap size ([4]), via ./scala -J-Xmx2g. And although this solves my problem it would be good to know if there is a better way to create this random data that avoids me having to increase the memory available to the JVM? I therefore have these three questions, which I would be grateful if someone could answer:

  1. When does garbage collection occur in Scala, and in particular the Scala shell? In my commands above what is there that can get collected and so why is the GC being called (sorry this second part probably shows my lack of knowledge about the GC) ?

  2. Are my rough calculations of the amount of memory I am taking up approximatley valid (sure I expect a bit more overhead for the list and tuples but am assuming relatively not that much)? If so why do I run out of memory when my max heap size (239e6 bytes) should cover this? And if not what extra memory am I using?

  3. Is there a better way to create random data for this? For context I am trying to just create some dummy data that I can parallelise into Spark (using sc.parallelize) and then play around with. (so to get it to work when I moved to trying it in Spark I increased the driver memory by setting spark.driver.memory 2g in my spark conf rather than the -J-Xmx2g command above).

Thanks for your help!

Links

  1. Error java.lang.OutOfMemoryError: GC overhead limit exceeded
  2. How much memory does a string use in Java 8?
  3. How to view the current heap size that an application is using?
  4. Increase JVM heap size for Scala?
Community
  • 1
  • 1
jay--bee
  • 672
  • 1
  • 6
  • 8
  • +1 great question, with a lot of helpful research behind it; hard to believe it hasn't been upvoted and that it has only a single answer – doug Jan 30 '19 at 03:13

1 Answers1

0

To answer the REPL-specific part:

https://issues.scala-lang.org/browse/SI-4331

Folks doing big allocations usually prefer Array and Buffer.

Note that there's overhead in List, including boxing the primitive values.

JVM heap is managed in pools, which you can size relative to each other. But generally speaking:

scala> var x = new Array[Byte](20000000 * 4)
x: Array[Byte] = Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
scala> x = null
x: Array[Byte] = null

scala> x = new Array[Byte](20000000 * 4)
x: Array[Byte] = [B@475530b9

scala> x = null
x: Array[Byte] = null

scala> x = new Array[Byte](20000000 * 4)
java.lang.OutOfMemoryError: Java heap space
  ... 32 elided
som-snytt
  • 39,429
  • 2
  • 47
  • 129