0

Please advice on algorithm and implementation to compare elements in a very long list in Scala. I have a list with thousands of strings (from SQL) and I need to compare each list element with all other elements in this list.

As a result I need to get a list of tuples: List[(String, String, Boolean)] where first two elements are strings to match and third is a result.

For a list of N elements my algorithm so far is as follows:

  1. Take head of the list
  2. Compare head with remaining N-1 elements in the list
  3. Make new list from a tail of the old list and do all above work with this new list of N -1 elements:

Code:

   /**
   * Compare head of the list with each remaining element in this list
   */
  def cmpel(
    fst: String, lst: List[String],
    result: List[(String, String, Boolean)]): List[(String, String, Boolean)] = {

    lst match {
      case next :: tail => cmpel(fst, tail, (fst, next, fst == next) :: result)
      case nill => result.reverse
    }
  }

  /**
   * Compare list elements in all combinations of two
   */
  def cmpAll(lst: List[String],
    result: List[(String, String, Boolean)]): List[(String, String, Boolean)] = {
    lst match {
      case head :: tail => cmpAll(tail, result ++ cmpel(head, tail, List()))
      case nill => result
    }
  }

  def main(args: Array[String]): Unit = {
    val lst = List[String]("a", "b", "b", "a")
    println(cmpAll(lst, List()))
  }

Result:

 List((a,b,false), (a,b,false), (a,a,true), (b,b,true), (b,a,false), (b,a,false))

Thanks!

dhg
  • 52,383
  • 8
  • 123
  • 144
DarqMoth
  • 603
  • 1
  • 13
  • 31
  • WHat are you then doing with the results of all this? It looks rather like something you should be doing in the DB, not in scala code – The Archetypal Paul Jul 06 '14 at 14:21
  • 1
    And is the order important? Otherwise, you could just sort them (or get them in sorted order from SQL). If necessary, have another column with the original order. Your comparison can only be equal if you have duplicates, so you can do that in one scan through the list Keep a list for each entry of other elements it is equal to. Re-sort into the original order if necessary. – The Archetypal Paul Jul 06 '14 at 14:24
  • In general case I need to do approximate match of every two strings. So sort will not help, I think – DarqMoth Jul 06 '14 at 14:36
  • OK. And what then? What do you do with the approximately-matched pairs? How many will match? do you need all pairs in the result list or can you assemble only the matched (or unmatched) and assume any pairs not present must have the other result? What's the approximate-match criteria? (you have an answer to your specific question, I'm just curious to know if it's the right question :)) – The Archetypal Paul Jul 06 '14 at 14:40
  • Please see my next, related question: http://stackoverflow.com/questions/24599139/scala-find-edit-distance-for-all-elements-of-a-list-not-fitting-in-memory – DarqMoth Jul 06 '14 at 18:47

1 Answers1

8

You can use the tails and flatMap methods to write a more concise and idiomatic solution:

list.tails.flatMap {
  case x :: rest => rest.map { y =>
    (x, y, x == y)
  }
  case _ => List()
}.toList

The tails method returns an iterator that iterates over repeated applications of .tail to the list. The first element in the iterator is the list itself, then the tail of the list, and so on, finally returning the empty list.

wingedsubmariner
  • 13,350
  • 1
  • 27
  • 52
  • 1
    Thanks! Will `tails` allocate memory for all list tails at once or lazily? I have a huge input list, so this may be an issue. – DarqMoth Jul 06 '14 at 13:53
  • 1
    A quick test in the REPL or look at the API doc shows that the return type of tails is Iterator, so that indicates lazy. – Seth Tisue Jul 06 '14 at 14:27
  • @DarqMoth `tails` is lazy, but even if it wasn't structural sharing between the lists (thanks to immutability) means that the memory consumption would only be O(n), not the O(n^2) it would be if each element was a separate list. – wingedsubmariner Jul 06 '14 at 16:36
  • Please see my next, related question: http://stackoverflow.com/questions/24599139/scala-find-edit-distance-for-all-elements-of-a-list-not-fitting-in-memory – DarqMoth Jul 06 '14 at 18:48
  • Worth noting that the input `list.` needs to be a `List`. If it's only typed as a `Seq` the first `case =>` won't match. To avoid that you can simply use `.toList`: `mySeq.toList.tails.flatMap` – RedYeti Dec 14 '21 at 13:40