0

It's a function to find the third largest of a collection of integers. I'm calling it like this:

val lineStream = thirdLargest(Source.fromFile("10m.txt").getLines.toIterable
val intStream = lineStream map { s => Integer.parseInt(s) }
thirdLargest(intStream) 

The file 10m.txt contains 10 million lines with a random integer on each one. The thirdLargest function below should not be keeping any of the integers after it has tested them, and yet it causes the JVM to run out of memory (after about 90 seconds in my case).

def thirdLargest(numbers: Iterable[Int]): Option[Int] = {
    def top3of4(top3: List[Int], fourth: Int) = top3 match {
        case List(a, b, c) =>
            if (fourth > c) List(b, c, fourth)
            else if (fourth > b) List(b, fourth, c)
            else if (fourth > a) List(fourth, b, c)
            else top3
    }

    @tailrec
    def find(top3: List[Int], rest: Iterable[Int]): Int = (top3, rest) match {
        case (List(a, b, c), Nil) => a
        case (top3, d #:: rest) => find(top3of4(top3, d), rest)
    }

    numbers match {
        case a #:: b #:: c #:: rest => Some(find(List[Int](a, b, c).sorted, rest))
        case _ => None
    }
}
Isvara
  • 3,403
  • 1
  • 28
  • 42
  • 1
    There is a `BufferedSource` which seems to be better fit to read from a file (it also provides an iterator to go over lines sequentially) – Ashalynd Oct 31 '13 at 14:06
  • fromFile already returns a BufferedSource. – Isvara Oct 31 '13 at 17:53
  • I am using `source.getLines().buffered` and the `BufferedIterator` returned seems to work as expected (as in not blowing up with an OutOfMemoryException on super large files). Here's where I created an answer based on this exact strategy to handle very large text files: http://stackoverflow.com/a/34283624/501113 – chaotic3quilibrium Dec 15 '15 at 08:06

1 Answers1

2

The OOM error has nothing to do with the way you read the file. It is totally fine and even recommended to use Source.getLines here. The problem is elsewhere.

Many people are being confused by the nature of Scala Stream concept. In fact this is not something you would want to use just to iterate over things. It is lazy indeed however it doesn't discard previous results – they're being memoized so there's no need to recalculate them again on the next use (which never happens in your case but that's where your memory goes). See also this answer.

Consider using foldLeft. Here's a working (but intentionally simplified) example for illustration purposes:

val lines = Source.fromFile("10m.txt").getLines()

print(lines.map(_.toInt).foldLeft(-1 :: -1 :: -1 :: Nil) { (best3, next) =>
  (next :: best3).sorted.reverse.take(3)
})
Community
  • 1
  • 1
coffeesnake
  • 1,193
  • 9
  • 8