Multiple Parallel grep of big file using scala future not as speedy as required

Question

I have a considerably big file(20-30 Mb) . I have a map where I have a key and the corresponding regex as value which I need to grep in the file to get the actual value of the key and store the new key,value in a new map . So here is my approach

contextmap //initial map which contains key and value in form of regex
contextstrings // final map supposed to have value after the grep

def fgrepFuture(e: (String,String)) = Future { 
val re = new Regex(e._2)
Source.fromFile(f).getLines.foreach {

re findFirstMatchIn _ match {
case None => ("","")
case Some(x) =>(e._1,x.group(1))
}
                                                        }
}
val fg = Future.traverse(tmpmap)(fgrepFuture)
fg onComplete{
case tups => for(t <- tups) contextstrings += (t.toString.split(",").head -> t.toString.split(",").tail.head)
}

Problem here is that by the time future completes my rest of the code(based on asynchronous model of akka actors) moves ahead too far that I don't have the grepped value from file in quick time (which I want to be globally available).I need to get the values fast which I don't know why this approach is not giving me(since multiple future works in parallel),so please point out the flaw.Also if at all there is a better approach to get multiple value grepped from a considerably large file please suggest that as well.

But you were implying your computation is based on a previous one? — Stefan Kunze, Nov 14 '13 at 13:34
Yes,all the greps are needed to get the key values which are required globally for later parsing of the whole file .But still my requirement is strictly that the so called pre-processing (all the grepping) shouldn't block the rest of the flow. I need to abide to this. — Harsh Gupta, Nov 14 '13 at 13:38
And by blocking the flow I meant that the pre-processing should be fast enough that it gets completed before the file parsing starts . Await is definitely an option , but I am trying to make the grepping fast enough that await is not required . — Harsh Gupta, Nov 15 '13 at 04:54

score 2 · Accepted Answer · edited May 23 '17 at 12:19

You can identify the furthest point in your program where if reached and the Feature is not complete than you need to use await, with a limited benefit that you could do some work meanwhile. Something else that you can do is try to grep in a parallel manner like this.

val chunkSize = 128 * 1024
val iterator = Source.fromFile(path).getLines.grouped(chunkSize)
iterator.foreach { lines => 
    lines.par.foreach { line => process(line) }
}

based on this post.

score 1 · Answer 2 · answered Nov 14 '13 at 16:47

You might be doing parallel work, but it seems all your parallel tasks are reading the same file, f. This is obviously going to be extremely slow... even slower than reading the file just once.

IO is always a bottleneck and there is nothing that parallelism can do about that.

You can either:

1) Just do one pass over the file and grab all keys in that single pass.

2) Load the file in memory and than have the parallel tasks work on that read-only data structure.

Option 2) would be useful if a lot of work was done by each task, but since you are just grepping, I would go with option 1).

Thanks this helped a lot in improving the response time. – Harsh Gupta Nov 15 '13 at 11:33 — Harsh Gupta, Nov 15 '13 at 11:33

Multiple Parallel grep of big file using scala future not as speedy as required

2 Answers2