GC overhead limit reached when reading a CSV in Scala

Question

I'm having issues reading in a file in Scala - still a bit of a Scala noob I'm afraid. I have to read a file which is roughly 500Mb, split it on a delimiter, and add to a map for later lookups.

My code is like this:

val inF = args(0)
for(lines: String <- scala.io.Source.fromFile(inF).getLines) {
    val xs = lines.split(",")
    // do some work on the result
    // update a hashmap
}

Within a few seconds, I get an error:

java.lang.OutOfMemoryError: GC overhead limit exceeded
>         at java.util.ArrayList.subList(ArrayList.java:955)
>         at java.lang.String.split(String.java:2311)
>         at java.lang.String.split(String.java:2355)
>         at Main$$anon$1$$anonfun$5.apply(cosine.scala:41)
>         at Main$$anon$1$$anonfun$5.apply(cosine.scala:37)
>         at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>         at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>         at Main$$anon$1.<init>(cosine.scala:37)
>         at Main$.main(cosine.scala:1)
>         at Main.main(cosine.scala)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at scala.tools.nsc.util.ScalaClassLoader$$anonfun$run$1.apply(ScalaClassLoader.scala:71)
>         at scala.tools.nsc.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
>         at scala.tools.nsc.util.ScalaClassLoader$URLClassLoader.asContext(ScalaClassLoader.scala:139)
>         at scala.tools.nsc.util.ScalaClassLoader$class.run(ScalaClassLoader.scala:71)
>         at scala.tools.nsc.util.ScalaClassLoader$URLClassLoader.run(ScalaClassLoader.scala:139)
>         at scala.tools.nsc.CommonRunner$class.run(ObjectRunner.scala:28)
>         at scala.tools.nsc.ObjectRunner$.run(ObjectRunner.scala:45)
>         at scala.tools.nsc.CommonRunner$class.runAndCatch(ObjectRunner.scala:35)
>         at scala.tools.nsc.ObjectRunner$.runAndCatch(ObjectRunner.scala:45)
>         at scala.tools.nsc.ScriptRunner.scala$tools$nsc$ScriptRunner$$runCompiled(ScriptRunner.scala:171)
>         at scala.tools.nsc.ScriptRunner$$anonfun$runScript$1.apply(ScriptRunner.scala:188)
>         at scala.tools.nsc.ScriptRunner$$anonfun$runScript$1.apply(ScriptRunner.scala:188)
>         at scala.tools.nsc.ScriptRunner$$anonfun$withCompiledScript$1.apply$mcZ$sp(ScriptRunner.scala:157)
>         at scala.tools.nsc.ScriptRunner$$anonfun$withCompiledScript$1.apply(ScriptRunner.scala:131)
>         at scala.tools.nsc.ScriptRunner$$anonfun$withCompiledScript$1.apply(ScriptRunner.scala:131)
>         at scala.tools.nsc.util.package$.trackingThreads(package.scala:51)
>         at scala.tools.nsc.util.package$.waitingForThreads(package.scala:35)
>         at scala.tools.nsc.ScriptRunner.withCompiledScript(ScriptRunner.scala:130)

Any help would be greatly appreciated!

UPDATE: more information on the problem:

I want to form a sparse vector of type (variable 2 -> value) for each variable 1. I then need to compare the similarity between the sparse vector for each pair of variable 1, which could be people or unique IDs.

My CSV looks like this:

variable1,variable2,value
"Alice","A",0.9
"Alice","B",0.8
"Alice","C",0.9
"Bob","A",0.5
"Bob","B",0.7
"Bob","D",0.9

My whole code is like this (minus the similarity function):

val m = new scala.collection.mutable.HashMap[String, scala.collection.mutable.HashMap[String, Double]]

for(lines: String <- scala.io.Source.fromFile(inF).getLines) {
    lines match {
        case "variable1,variable2,rating" => println("header skipping")
        case _ =>
    val xs = lines.split(",")
    val var1 = xs(0)
    val var2 = xs(1)
    val rat = xs(2).toDouble
    val map = m.get(var1)
    map match {
        case Some(x) => x.update(var2, rat)
                        m.update(var1, x)
        case None    => val tmpMap = new scala.collection.mutable.HashMap[String, Double]
                        tmpMap.update(var2, rat)
                        m.update(var1, tmpMap)
    }
    }
}

val data = m.par

val results = for {
    (var1, xs) <- data
    (var2, ys) <- m
    if (var1 < var2)
} yield( (var1, var2, similarity(xs, ys)))

So I have to find and compare pairs of (variable 1, sparse vector), and get the similarity between them.

try increase application memory using `Xms` and `Xmx` keys [more info](http://stackoverflow.com/questions/14763079/what-are-the-xms-and-xmx-parameters-when-starting-jvms) — Sergii Lagutin, Mar 13 '15 at 09:44

score 0 · Answer 1 · answered Mar 13 '15 at 10:06

0

java.lang.OutOfMemoryError: GC overhead limit exceeded means JVM garbage collector consumes about 70% of the overall JVM CPU time. This might indicate you create a lot of garbage that consists of many small objects. In your case I expect you're running Java 7 or later and the heap is overflew with split non-repeatable strings. // update a hashmap is pretty suspicious as well. What exactly do you do?

answered Mar 13 '15 at 10:06

Alexander Lomov

46
3

`val data = m.par` causes GC overhead since it performs lots of boxings with double values of a map in a map. Can't figure out what you're trying to achieve. – Alexander Lomov Mar 14 '15 at 14:38
The GC error occurs during reading the file - if I run the code with print statements I can see I don't get beyond the for loop. – Guy Needham Mar 14 '15 at 15:27
What's your Java runtime version? If it's 1.6 or lower then the GC overhead might happen on permanent generation extensive usage. – Alexander Lomov Mar 16 '15 at 00:15

GC overhead limit reached when reading a CSV in Scala

1 Answers1