4

In an attempt to rewrite PHP's similar_text algorithm I have tried a few different approaches. All have been moderately successful but ultimately failed.

First attempt: I tried just rewriting it from the PHP source code. C's elegant use of pointers makes the same exact implementation seemingly impossible to make in Scala and be clean.

Second attempt: I tried rewriting it from a Java function someone posted on PHP similar_text() in java. Unfortunately that function doesn't work in Java so nevermind porting it over to Scala.

Third (current) attempt: I'm currently attempting to translate this JavaScript implementation into Scala: http://phpjs.org/functions/similar_text/. I've used it before in JavaScript and it seems to function properly. My translation (below) into Scala is not functioning properly. It gets you within 1 or 2 similarity indexes but it is typically not 100% to the results of it's PHP counterpart.

def similartext(first:String,second:String) : Int = {
  if (first == null || second == null) {
    0
  }

  var pos1:Int = 0
  var pos2:Int = 0
  var max:Int = 0
  var sum:Int = 0
  var l:Int = 0

  val firstLength:Int = first.length
  val secondLength:Int = second.length

  for (p <- 0 until firstLength) {
    for (q <- 0 until secondLength) {
      while(p+l < firstLength && q+l < secondLength && (first.charAt(p+l) == second.charAt(q+l))) {
        if (l > max) {
            println("[" + p + "," + q + "," + l + "]" + first.charAt(p+l) + " | " + second.charAt(q+l))
            max = l
            pos1 = p
            pos2 = q
          }
        l += 1
      }
    }
  }

  sum = max;

  if (sum > 0) {
    if (pos1 > 0 && pos2 > 0) {
      sum += similartext(first.substring(0, pos2), second.substring(0, pos2))
    }

    if ((pos1 + max < firstLength) && (pos2 + max < secondLength)) {
      sum += similartext(first.substring(pos1 + max, (pos1 + max) + (firstLength - pos1 - max)), second.substring(pos2 + max, (pos2 + max) + (secondLength - pos2 - max)))
    }
  }

  sum;
}

Tests:

(Scala)val st = similartext("apple","aple") Yields 3
(PHP)$similar = similar_text("apple","aple"); Yields 4

(Scala)val st = similartext("starbucks","stharducks") Yields 8
(PHP)$similar = similar_text("starbucks","stharducks"); Yields 8

(Scala)val st = similartext("hello earth!","hello world!") Yields 10
(PHP)$similar = similar_text("hello earth!","hello world!"); Yields 8

Does anyone have any ideas on what is going wrong here?

Community
  • 1
  • 1
Commander
  • 1,322
  • 2
  • 13
  • 29

1 Answers1

7

Here's a hint: look very closely at line 28 of the JavaScript version—particularly the last character of the line. That's where your implementation differs. (You also don't reset l to zero for every pair of indices, but that's not the most important problem.)

Here's a var-free Scala version, by the way:

def similarText(x: String, y: String): Int = {
  val indices = for {
    (s, p) <- x.tails.zipWithIndex
    (t, q) <- y.tails.zipWithIndex
    l = ((s zip t) takeWhile Function.tupled(_ == _)).size
  } yield (p, q, l)
  val (pos1, pos2, max) = indices.maxBy(_._3)

  if (max == 0) max else max +
    similarText(x take pos1, y take pos2) +
    similarText(x drop (pos1 + max), y drop (pos2 + max))
}

This is fairly off-the-cuff—I'm sure you could make it more concise and efficient pretty easily.

And for extra credit: there's a bug in the JavaScript version—try for example "aabcd" and "abcabcd" and the result won't be the same as PHP's (or mine).

Travis Brown
  • 138,631
  • 12
  • 375
  • 680
  • Thank you so much. Not only does this point out an (two) obvious flaw in the approach but it also gives me some great code to analyze and improve at Scala. I just started with Scala about two days ago so this is a huge help. p.s. when I get around to it I'm going to take a look at the JS code again and figure out that bug you're speaking about. I believe it has to do with skipping the first character comparison but that's just off the top of my head without reviewing it again. – Commander Nov 11 '12 at 01:19
  • The JavaScript bug is actually a lot more boring than that! I'm hesitant to let you waste any time looking for it, since it's just a typo, so let me know if you want me to add a more direct pointer to the answer. – Travis Brown Nov 11 '12 at 01:30