Using par map to increase performance

Question

Below code runs a comparison of users and writes to file. I've removed some code to make it as concise as possible but speed is an issue also in this code :

   import scala.collection.JavaConversions._

object writedata {

  def getDistance(str1: String, str2: String) = {

    val zipped = str1.zip(str2)
    val numberOfEqualSequences = zipped.count(_ == ('1', '1')) * 2

    val p = zipped.count(_ == ('1', '1')).toFloat * 2
    val q = zipped.count(_ == ('1', '0')).toFloat * 2
    val r = zipped.count(_ == ('0', '1')).toFloat * 2
    val s = zipped.count(_ == ('0', '0')).toFloat * 2

    (q + r) / (p + q + r)

  }                                               //> getDistance: (str1: String, str2: String)Float

    case class UserObj(id: String, nCoordinate: String)
  val userList = new java.util.ArrayList[UserObj] //> userList  : java.util.ArrayList[writedata.UserObj] = []
  for (a <- 1 to 100) {
    userList.add(new UserObj("2", "101010"))
  }
  def using[A <: { def close(): Unit }, B](param: A)(f: A => B): B =
    try { f(param) } finally { param.close() }    //> using: [A <: AnyRef{def close(): Unit}, B](param: A)(f: A => B)B

  def appendToFile(fileName: String, textData: String) =
    using(new java.io.FileWriter(fileName, true)) {
      fileWriter =>
        using(new java.io.PrintWriter(fileWriter)) {
          printWriter => printWriter.println(textData)
        }
    }                                             //> appendToFile: (fileName: String, textData: String)Unit

  var counter = 0;                                //> counter  : Int = 0

  for (xUser <- userList.par) {
    userList.par.map(yUser => {
      if (!xUser.id.isEmpty && !yUser.id.isEmpty)
        synchronized {
          appendToFile("c:\\data-files\\test.txt", getDistance(xUser.nCoordinate , yUser.nCoordinate).toString)
        }
    })
  }

}

The above code was previously an imperative solution, so the .par functionality was within an inner and outer loop. I'm attempting to convert it to a more functional implementation while also taking advantage of Scala's parallel collections framework.

In this example the data set size is 10 but in the code im working on the size is 8000 which translates to 64'000'000 comparisons. I'm using a synchronized block so that multiple threads are not writing to same file at same time. A performance improvment im considering is populating a separate collection within the inner loop ( userList.par.map(yUser => {) and then writing that collection out to seperate file.

Are there other methods I can use to improve performance. So that I can handle a List that contains 8000 items instead of above example of 100 ?

1. Looks like your 'appendToFile' method opens and closes the filewriter every time you write to file. Try to open the file in the beggining of processing and close it at the end. Additionaly, try to not block the threads of 'par' processing. Use ArrayBlockingQueue and write to file from another execution context. Use AtomicInt or AtomicLong for the counter — Yuriy Shinbuev, Apr 10 '14 at 22:56
Also suggest to use StringBuffer (as it's threadsafe), and periodically flush it to PrintWriter — Yuriy Shinbuev, Apr 10 '14 at 23:06

score 1 · Answer 1 · edited May 23 '17 at 12:05

I'm not sure if you removed too much code for clarity, but from what I can see, there is absolutely nothing that can run in parallel since the only thing you are doing is writing to a file.

EDIT:

One thing that you should do is to move the getDistance(...) computation before the synchronized call to appendToFile, otherwise your parallelized code ends up being sequential.

Instead of calling a synchronized appendToFile, I would call appendToFile in a non-synchronized way, but have each call to that method add the new line to some synchronized queue. Then I would have another thread that flushes that queue to disk periodically. But then you would also need to add something to make sure that the queue is also flushed when all computations are done. So that could get complicated...

Alternatively, you could also keep your code and simply drop the synchronization around the call to appendToFile. It seems that println itself is synchronized. However, that would be risky since println is not officially synchronized and it could change in future versions.

Using par map to increase performance

1 Answers1