Below code runs a comparison of users and writes to file. I've removed some code to make it as concise as possible but speed is an issue also in this code :
import scala.collection.JavaConversions._
object writedata {
def getDistance(str1: String, str2: String) = {
val zipped = str1.zip(str2)
val numberOfEqualSequences = zipped.count(_ == ('1', '1')) * 2
val p = zipped.count(_ == ('1', '1')).toFloat * 2
val q = zipped.count(_ == ('1', '0')).toFloat * 2
val r = zipped.count(_ == ('0', '1')).toFloat * 2
val s = zipped.count(_ == ('0', '0')).toFloat * 2
(q + r) / (p + q + r)
} //> getDistance: (str1: String, str2: String)Float
case class UserObj(id: String, nCoordinate: String)
val userList = new java.util.ArrayList[UserObj] //> userList : java.util.ArrayList[writedata.UserObj] = []
for (a <- 1 to 100) {
userList.add(new UserObj("2", "101010"))
}
def using[A <: { def close(): Unit }, B](param: A)(f: A => B): B =
try { f(param) } finally { param.close() } //> using: [A <: AnyRef{def close(): Unit}, B](param: A)(f: A => B)B
def appendToFile(fileName: String, textData: String) =
using(new java.io.FileWriter(fileName, true)) {
fileWriter =>
using(new java.io.PrintWriter(fileWriter)) {
printWriter => printWriter.println(textData)
}
} //> appendToFile: (fileName: String, textData: String)Unit
var counter = 0; //> counter : Int = 0
for (xUser <- userList.par) {
userList.par.map(yUser => {
if (!xUser.id.isEmpty && !yUser.id.isEmpty)
synchronized {
appendToFile("c:\\data-files\\test.txt", getDistance(xUser.nCoordinate , yUser.nCoordinate).toString)
}
})
}
}
The above code was previously an imperative solution, so the .par functionality was within an inner and outer loop. I'm attempting to convert it to a more functional implementation while also taking advantage of Scala's parallel collections framework.
In this example the data set size is 10 but in the code im working on the size is 8000 which translates to 64'000'000 comparisons. I'm using a synchronized block so that multiple threads are not writing to same file at same time. A performance improvment im considering is populating a separate collection within the inner loop ( userList.par.map(yUser => {) and then writing that collection out to seperate file.
Are there other methods I can use to improve performance. So that I can handle a List that contains 8000 items instead of above example of 100 ?