1

I used to handle stats stuff in Python. For example, a large file contains tens of millions of ids:

$ cat report_ids | head
3788065
7950319
140494477
182851142
120757318
160033281
78087029
42591118
104363873
212143796
...

In IPython, the following lines works well always.

In [1]: lines = [line.strip() for  line in open('./report_ids').readlines()]

In [2]: from collections import Counter

In [3]: d = Counter(lines)

In [4]: d[lines[0]]
Out[4]: 9

When I try the same in Scala, an out of memory errors occurs.

val lines = scala.io.Source.fromFile("./report_ids").getLines.toList
lines: List[String] = List(3788065, 7950319, 140494477, 182851142, 120757318, 160033281, 78087029, 42591118, 104363873, 212143796, 175644298, 112703123, 213308679, 109649718, 11947300, 214660563, 83402867, 162877289, 83030111, 78231639, 45129180, 11635655, 34778452, 46604760, 142519099, 213261965, 137812002, 167057636, 119258917, 212722777, 177979907, 13754217, 156769524, 40682536, 202195379, 91879046, 22766751, 6656279, 11972540, 76929862, 91616020, 110579570, 143849021, 27239477, 65146692, 142968764, 153891284, 182405787, 153038108, 50714639, 113386401, 96657813, 75908413, 32215626, 175000692, 154337083, 113754207, 165109267, 3788065, 42285876, 171004203, 109802388, 92956305, 46690091, 103638776, 15141632, 110579570, 120984867, 183167775, 86841540, 60465849, 27239477, 91760184, 213464...

scala> val g = lines.groupBy(e => e).mapValues(x => x.length)
java.lang.OutOfMemoryError: GC overhead limit exceeded
    at scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:328)
    at scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:326)
    at scala.collection.immutable.HashMap.$plus(HashMap.scala:57)
    at scala.collection.immutable.HashMap.$plus(HashMap.scala:36)
    at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:28)
    at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:24)
    at scala.collection.TraversableLike$$anonfun$groupBy$3.apply(TraversableLike.scala:334)
    at scala.collection.TraversableLike$$anonfun$groupBy$3.apply(TraversableLike.scala:333)
    at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
    at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
    at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
    at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
    at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
    at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
    at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
    at scala.collection.TraversableLike$class.groupBy(TraversableLike.scala:333)
    at scala.collection.AbstractTraversable.groupBy(Traversable.scala:105)
    at .<init>(<console>:8)
    at .<clinit>(<console>)
    at .<init>(<console>:7)
    at .<clinit>(<console>)
    at $print(<console>)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:734)
    at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:983)
    at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:573)
    at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:604)
    at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:568)
    at scala.tools.nsc.interpreter.ILoop.reallyInterpret$1(ILoop.scala:760)

Then I tried the lazy method in Scala, its still not working.

scala> lazy val lines = scala.io.Source.fromFile("./report_ids").getLines.toList
lines: List[String] = <lazy>

scala> val g = lines.groupBy(e => e).mapValues(x => x.length)
java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOfRange(Arrays.java:2694)
    at java.lang.String.<init>(String.java:203)
    at java.io.BufferedReader.readLine(BufferedReader.java:349)
    at java.io.BufferedReader.readLine(BufferedReader.java:382)
    at scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:67)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
    at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:176)
    at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
    at scala.collection.AbstractIterator.to(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:257)
    at scala.collection.AbstractIterator.toList(Iterator.scala:1157)
    at .lines$lzycompute(<console>:7)
    at .lines(<console>:7)
    at .<init>(<console>:8)
    at .<clinit>(<console>)
    at .<init>(<console>:7)
    at .<clinit>(<console>)
    at $print(<console>)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:734)
    at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:983)
    at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:573)
    at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:604)
    at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:568)
    at scala.tools.nsc.interpreter.ILoop.reallyInterpret$1(ILoop.scala:760)
    at scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:805)

So how can I finished the group by work in Scala as the way I did in Python? Thanks.

zfz
  • 1,597
  • 1
  • 22
  • 45

1 Answers1

8

I think are doing something rather different in Python and Scala. Let's look at that first line:

lines = [line.strip() for  line in open('./report_ids').readlines()]

It looks to me that you are working with an iterable here (in Scala terms), not a real list. I might be wrong -- I don't work with Python often enough to remember -- but let's assume you are and see how you can get the same thing in Scala. You had this:

val lines = scala.io.Source.fromFile("./report_ids").getLines.toList

Now, Scala doesn't have a file iterable in the standard library, though I think Scala I/O comes with one. Here, you can do this to get an iterator (not the same thing):

val lines = scala.io.Source.fromFile("./report_ids").getLines

Just don't turn it into a List. :) Now, since this is an iterator, not an iterable, it would fail once you used it twice. So let's write it like this instead:

def lines = scala.io.Source.fromFile("./report_ids").getLines

Now you can use "lines" multiple times. Sadly, you'll leak file descriptors -- for more serious I/O handling, look at a more serious I/O library such as Scalaz Stream or Scala I/O. Or use the loan pattern.

Next, you replace the Counter code with this:

val g = lines.groupBy(e => e).mapValues(x => x.length)

That's going to be memory intensive. Something like this should be much better:

val g = scala.collection.mutable.HashMap.empty[String, Int] withDefaultValue 0
for (line <- lines) g(line) += 1

Since lines is an Iterator, you won't be able to do lines(0). For the first line, you could just do lines.next, or you can do lines.toStream(0) to be access an index without reading the whole file into memory.

Daniel C. Sobral
  • 295,120
  • 86
  • 501
  • 681