Finding a Motif in DNA

Question

Problem can be found here: http://rosalind.info/problems/subs/

The question I have has to do with the performance of the two solutions provided below.

1.

  def indexOfAppearances(strand: String, subStrand: String): List[Int] = {
    def help0(x: String, y: String, accu: List[Int]): List[Int] =
      (x contains y) match {
        case true => {
          val index = (x indexOfSlice y) /* index where substring appears */
          val adjust = strand.length - x.length
          /* adjustment since the x has all the previous
               * nucleotides removed.
               */

          val newX = x.drop(index + 1).mkString
           /* here's the drop of index + 1 elements */
          help0(newX, y, (index + adjust) :: accu) /* tail recursive call */
        }
        case false => accu
      }

     help0(strand, subStrand, List()).reverse.toList.map(x => x + 1)
         //indexes are not from 0 but from 1
  }

2.

  val s = "ACGTACGTACGTACGT"
  val t = "GTA"

  val combs = s.sliding(t.length).zipWithIndex
  val locations = combs.collect { case (sub, i) if (sub == t) => i + 1 }
  println(locations.mkString(" "))

The second solution is pretty, functional and short.

First solution is a bit large but it's still functional. I could have left out the val's and just work with the values to make it shorter but that wasn't my goal.

After seeing the 2nd solution I was quite disappointed by mine due to the length of code. Checked the scala library to see why the 2nd solution works and then re-implemented it myself. Thought about checking the performance of both solutions and made a huge 30 million DNA strand.

Surprised!

Performance:

First number is the DNA length, and the next two numbers present the execution time of 1st and 2nd solution(in ms).

11,226,096 - 4921 - 14503

33,678,288 - 6448 - 35150

Why is the performance so much different?

I've tried checking the scala library but couldn't find a thing which explains this behaviour.

I assumed that the first solution is creating many objects thus consuming more memory and taking a lot of time to do that but it seems that for some reason it works much faster. I doubt it's the tail recursion and I doubt that zipWithIndex takes a lot of time. Iterators are just iterators?

Thanks!

By the way, what is the point of using pattern matching against true/false? why not just `if`? — om-nom-nom, Nov 08 '12 at 22:16
I used the if first, changed it for reasons unknown, I was probably doing something else and eventually realized it was unnecessary and then just made quick edit. — , Nov 08 '12 at 22:23

score 2 · Accepted Answer · answered Nov 08 '12 at 22:17

2

sliding isn't efficient with strings. It breaks the string down into characters, boxes them, then reassembles them into a string.

The fastest way to do this which is still not incredibly hard is with the regionMatches method on String. (Faster yet with DNA is to convert everything into bytes, faster yet is to convert it into 2-bit nibbles and pack into int arrays.)

answered Nov 08 '12 at 22:17

Rex Kerr

166,841
26
322
407

Would applying this algorithm provide a quicker implementation too? http://en.wikipedia.org/wiki/Knuth-Morris-Pratt_algorithm – Nov 08 '12 at 23:09
@Leolinus - Probably, but KMP works best with a large number of distinct characters. The extra logic is often not worth it when your jumps are short. Other strategies are likely to give bigger gains for less effort. – Rex Kerr Nov 08 '12 at 23:15

Finding a Motif in DNA

1 Answers1