19

I really like the

for (line <- Source fromFile inputPath getLines) {doSomething line}

construction for iterating over a file in scala and am wondering if there is a way to use a similar construction for iterating over the lines in all the files in a directory.

An important restriction here is that all files add up to an amount of space that would generate a heap overflow. (think dozens of GB, so increasing heap size isn't an option) As a work around for the time being, I have been cat'ing every together into one file and using the above construction which works b/c of laziness.

Point being, this seems to raise questions like.. can I concatenate two (hundred) lazy iterators and get a really big, really lazy one?

chuck taylor
  • 2,476
  • 5
  • 29
  • 46

1 Answers1

28

Yes, although it's not quite so concise:

import java.io.File
import scala.io.Source

for {
  file <- new File(dir).listFiles.toIterator if file.isFile
  line <- Source fromFile file getLines
} { doSomething line }

The trick is flatMap and its for-comprehension syntactic sugar. The above, for example, is more or less equivalent to the following:

new File(dir)
  .listFiles.toIterator
  .filter(_.isFile)
  .flatMap(Source fromFile _ getLines)
  .map(doSomething)

As Daniel Sobral notes in a comment below, this approach (and the code in your question) will leave files open. If this is a one-off script or you're just working in the REPL, this might not be a big deal. If you do run into problems, you can use the pimp-my-library pattern to implement some basic resource management:

implicit def toClosingSource(source: Source) = new {
  val lines = source.getLines
  var stillOpen = true
  def getLinesAndClose = new Iterator[String] {
    def hasNext = stillOpen && lines.hasNext
    def next = {
      val line = lines.next
      if (!lines.hasNext) { source.close() ; stillOpen = false }
      line
    }
  }
}

Now just use Source fromFile file getLinesAndClose and you won't have to worry about files being left open.

Community
  • 1
  • 1
Travis Brown
  • 138,631
  • 12
  • 375
  • 680
  • That's perfect, I just ran over around 10gb of files using the scala repl with a code bit based on that and the memory usage barely budged. Thanks much! – chuck taylor Apr 10 '12 at 23:19
  • 1
    Note, however, that the `Source` for each file is not being closed. In this particular case, where code might touch hundreds of files, using some sort of ARM is important. – Daniel C. Sobral Apr 11 '12 at 01:09