I have a huge file (does not fit into memory) which is tab separated with two columns (key
and value
), and pre-sorted on the key
column. I need to call a function on all values for a key and write out the result. For simplicity, one can assume that the values are numbers and the function is addition.
So, given an input:
A 1
A 2
B 1
B 3
The output would be:
A 3
B 4
For this question, I'm not so much interested in reading/writing the file, but more in the list comprehension side. It is important though that the whole content (input as well as output) doesn't fit into memory. I'm new to Scala, and coming from Java I'm interested what would be the functional/Scala way to do that.
Update:
Based on AmigoNico's comment, I came up with the below constant memory solution. Any comments / improvements are appreciated!
val writeAggr = (kv : (String, Int)) => {println(kv._1 + " " + kv._2)}
writeAggr(
( ("", 0) /: scala.io.Source.fromFile("/tmp/xx").getLines ) { (keyAggr, line) =>
val Array(k,v) = line split ' '
if (keyAggr._1.equals(k)) {
(k, keyAggr._2 + v.toInt)
} else {
if (!keyAggr._1.equals("")) {
writeAggr(keyAggr)
}
(k, v.toInt)
}
}
)