3

I have a considerably big file(20-30 Mb) . I have a map where I have a key and the corresponding regex as value which I need to grep in the file to get the actual value of the key and store the new key,value in a new map . So here is my approach

contextmap //initial map which contains key and value in form of regex
contextstrings // final map supposed to have value after the grep

def fgrepFuture(e: (String,String)) = Future { 
val re = new Regex(e._2)
Source.fromFile(f).getLines.foreach {

re findFirstMatchIn _ match {
case None => ("","")
case Some(x) =>(e._1,x.group(1))
}
                                                        }
}
val fg = Future.traverse(tmpmap)(fgrepFuture)
fg onComplete{
case tups => for(t <- tups) contextstrings += (t.toString.split(",").head -> t.toString.split(",").tail.head)
}

Problem here is that by the time future completes my rest of the code(based on asynchronous model of akka actors) moves ahead too far that I don't have the grepped value from file in quick time (which I want to be globally available).I need to get the values fast which I don't know why this approach is not giving me(since multiple future works in parallel),so please point out the flaw.Also if at all there is a better approach to get multiple value grepped from a considerably large file please suggest that as well.

Harsh Gupta
  • 339
  • 4
  • 20

2 Answers2

2

You can identify the furthest point in your program where if reached and the Feature is not complete than you need to use await, with a limited benefit that you could do some work meanwhile. Something else that you can do is try to grep in a parallel manner like this.

val chunkSize = 128 * 1024
val iterator = Source.fromFile(path).getLines.grouped(chunkSize)
iterator.foreach { lines => 
    lines.par.foreach { line => process(line) }
}

based on this post.

Community
  • 1
  • 1
Ion Cojocaru
  • 2,583
  • 15
  • 16
1

You might be doing parallel work, but it seems all your parallel tasks are reading the same file, f. This is obviously going to be extremely slow... even slower than reading the file just once.

IO is always a bottleneck and there is nothing that parallelism can do about that.

You can either:

1) Just do one pass over the file and grab all keys in that single pass.

2) Load the file in memory and than have the parallel tasks work on that read-only data structure.

Option 2) would be useful if a lot of work was done by each task, but since you are just grepping, I would go with option 1).

toto2
  • 5,306
  • 21
  • 24