I'm having issues reading in a file in Scala - still a bit of a Scala noob I'm afraid. I have to read a file which is roughly 500Mb, split it on a delimiter, and add to a map for later lookups.
My code is like this:
val inF = args(0)
for(lines: String <- scala.io.Source.fromFile(inF).getLines) {
val xs = lines.split(",")
// do some work on the result
// update a hashmap
}
Within a few seconds, I get an error:
java.lang.OutOfMemoryError: GC overhead limit exceeded
> at java.util.ArrayList.subList(ArrayList.java:955)
> at java.lang.String.split(String.java:2311)
> at java.lang.String.split(String.java:2355)
> at Main$$anon$1$$anonfun$5.apply(cosine.scala:41)
> at Main$$anon$1$$anonfun$5.apply(cosine.scala:37)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at Main$$anon$1.<init>(cosine.scala:37)
> at Main$.main(cosine.scala:1)
> at Main.main(cosine.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at scala.tools.nsc.util.ScalaClassLoader$$anonfun$run$1.apply(ScalaClassLoader.scala:71)
> at scala.tools.nsc.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
> at scala.tools.nsc.util.ScalaClassLoader$URLClassLoader.asContext(ScalaClassLoader.scala:139)
> at scala.tools.nsc.util.ScalaClassLoader$class.run(ScalaClassLoader.scala:71)
> at scala.tools.nsc.util.ScalaClassLoader$URLClassLoader.run(ScalaClassLoader.scala:139)
> at scala.tools.nsc.CommonRunner$class.run(ObjectRunner.scala:28)
> at scala.tools.nsc.ObjectRunner$.run(ObjectRunner.scala:45)
> at scala.tools.nsc.CommonRunner$class.runAndCatch(ObjectRunner.scala:35)
> at scala.tools.nsc.ObjectRunner$.runAndCatch(ObjectRunner.scala:45)
> at scala.tools.nsc.ScriptRunner.scala$tools$nsc$ScriptRunner$$runCompiled(ScriptRunner.scala:171)
> at scala.tools.nsc.ScriptRunner$$anonfun$runScript$1.apply(ScriptRunner.scala:188)
> at scala.tools.nsc.ScriptRunner$$anonfun$runScript$1.apply(ScriptRunner.scala:188)
> at scala.tools.nsc.ScriptRunner$$anonfun$withCompiledScript$1.apply$mcZ$sp(ScriptRunner.scala:157)
> at scala.tools.nsc.ScriptRunner$$anonfun$withCompiledScript$1.apply(ScriptRunner.scala:131)
> at scala.tools.nsc.ScriptRunner$$anonfun$withCompiledScript$1.apply(ScriptRunner.scala:131)
> at scala.tools.nsc.util.package$.trackingThreads(package.scala:51)
> at scala.tools.nsc.util.package$.waitingForThreads(package.scala:35)
> at scala.tools.nsc.ScriptRunner.withCompiledScript(ScriptRunner.scala:130)
Any help would be greatly appreciated!
UPDATE: more information on the problem:
I want to form a sparse vector of type (variable 2 -> value) for each variable 1. I then need to compare the similarity between the sparse vector for each pair of variable 1, which could be people or unique IDs.
My CSV looks like this:
variable1,variable2,value
"Alice","A",0.9
"Alice","B",0.8
"Alice","C",0.9
"Bob","A",0.5
"Bob","B",0.7
"Bob","D",0.9
My whole code is like this (minus the similarity function):
val m = new scala.collection.mutable.HashMap[String, scala.collection.mutable.HashMap[String, Double]]
for(lines: String <- scala.io.Source.fromFile(inF).getLines) {
lines match {
case "variable1,variable2,rating" => println("header skipping")
case _ =>
val xs = lines.split(",")
val var1 = xs(0)
val var2 = xs(1)
val rat = xs(2).toDouble
val map = m.get(var1)
map match {
case Some(x) => x.update(var2, rat)
m.update(var1, x)
case None => val tmpMap = new scala.collection.mutable.HashMap[String, Double]
tmpMap.update(var2, rat)
m.update(var1, tmpMap)
}
}
}
val data = m.par
val results = for {
(var1, xs) <- data
(var2, ys) <- m
if (var1 < var2)
} yield( (var1, var2, similarity(xs, ys)))
So I have to find and compare pairs of (variable 1, sparse vector), and get the similarity between them.