I have a bunch of xml files that I'm trying to process in parallel. My scala code (2.9.2) using future starts out fine but ends up eating up nearly 100% of the 32G I have on my machine. This doesn't happen when I do this sequentially so my guess is there's something wrong with garbage collection in using scala futures.
Here's a stripped down version of my code. Can anyone tell me what's wrong?
val filenameGroups = someStringListOfFilepaths.grouped(1000).toStream
val tasks = filenameGroups.map {
fg =>
scala.actors.Futures.future {
val parser = new nu.xom.Builder() // I'm using nu.xom. Not sure it matters.
fg.map {
path => {
val doc = parser.build(new java.io.File(path))
val result = doc.query(some xpath query)
result
}
}.toList
}
}
val pairs = tasks.par.flatMap(_.apply)
ETA: Ok, I solved this but I still have no idea why this makes a difference.
I abstracted most of the code in the inner loops then reran it. And pulled out the parser instantiation from future. Memory usage now stays flat at a decent 17%. Does anybody have any idea why this would make a difference?
Here's a simplified version of what I did:
def process(arglist...) = yada
val tasks = filenameGroups.map {
fg =>
val parser = new nu.xom.Builder()
scala.actors.Futures.future {
process(fg, parser)
}
}
val pairs = tasks.par.flatMap(_.apply)