how to match dna sequence pattern

Question

I am getting a trouble finding an approach to solve this problem.

Input-output sequences are as follows :

 **input1 :** aaagctgctagag 
 **output1 :** a3gct2ag2

 **input2 :** aaaaaaagctaagctaag 
 **output2 :** a6agcta2ag

Input nsequence can be of 10^6 characters and largest continuous patterns will be considered.

For example for input2 "agctaagcta" output will not be "agcta2gcta" but it will be "agcta2".

Any help appreciated.

What output must be provided for input `aabbaabb`? Two possible variants: `a2b2a2b2` and `aabb2`. — Egor Skriptunoff, Jun 01 '13 at 09:51
And what about `aaaaaaaaabbbbbbbbbaaaaaaaaabbbbbbbbb`: `a9b9a9b9` or `aaaaaaaaabbbbbbbbb2`? The former is shorter ;-) — Egor Skriptunoff, Jun 01 '13 at 11:22
number of characters and their counts should be minimum..for example a9b9a9b9 takes 8 alphanumeric count but aaaaaaaaabbbbbbbbb2 takes 19 alphanumeric count — user2442890, Jun 01 '13 at 11:50
So, you need to write kinda archiver working in "best compression" mode? — Egor Skriptunoff, Jun 01 '13 at 12:07
is there any limit in each input sequence on how large the repeated pattern can be, e.g., could you have a repeated sub-sequence of length 10^3? — גלעד ברקן, Jun 01 '13 at 19:04
basically we are working on DNA sequences....there is no limit on the sub sequence length..it could be of any length..but the sub sequence length won't go to 10^3 length...the maximum length is up to 100 characters as per me...since i referred different DNA sequences — user2442890, Jun 01 '13 at 19:40
Are you trying to create the output? Also what happens if there is no repeat in the middle e.g. `aabbaabbcxyzxyz` might be `aabb2cxyz2` but the reverse-engineered input could be `aabbaabbcxyzcxyz`? — glh, Jun 02 '13 at 00:35
@user2442890: If it was at least a textbook exercise with greedy approach and bad algos sufficient, but you seem to have a real world problem here... And if you require optimum, non-greedy solution, that whispers to me: 'dynamic programming' and 'Ukonnen's algorithm'. A tall question you are asking here. — Boris Stitnicky, Jun 02 '13 at 03:57
How fast do you need it to process one input of length 10^6? — גלעד ברקן, Jun 02 '13 at 14:39
@groovy: Trust me, exponential complexity won't suffice. This is, I guess, about compressing whole chromosomes. — Boris Stitnicky, Jun 02 '13 at 17:38

jbaylina · Answer 1 · 2013-06-12T15:07:29.590

Explanation of the algorithm:

Having a sequence S with symbols s(1), s(2),…, s(N).
Let B(i) be the best compressed sequence with elements s(1), s(2),…,s(i).
So, for example, B(3) will be the best compressed sequence for s(1), s(2), s(3).
What we want to know is B(N).

To find it, we will proceed by induction. We want to calculate B(i+1), knowing B(i), B(i-1), B(i-2), …, B(1), B(0), where B(0) is empty sequence, and and B(1) = s(1). At the same time, this constitutes a proof that the solution is optimal. ;)

To calculate B(i+1), we will pick the best sequence among the candidates:

Candidate sequences where the last block has one element:

B(i )s(i+1)1 B(i-1)s(i+1)2 ; only if s(i) = s(i+1) B(i-2)s(i+1)3 ; only if s(i-1) = s(i) and s(i) = s(i+1) … B(1)s(i+1)[i-1] ; only if s(2)=s(3) and s(3)=s(4) and … and s(i) = s(i+1) B(0)s(i+1)i = s(i+1)i ; only if s(1)=s(2) and s(2)=s(3) and … and s(i) = s(i+1)
Candidate sequences where the last block has 2 elements:

B(i-1)s(i)s(i+1)1 B(i-3)s(i)s(i+1)2 ; only if s(i-2)s(i-1)=s(i)s(i+1) B(i-5)s(i)s(i+1)3 ; only if s(i-4)s(i-3)=s(i-2)s(i-1) and s(i-2)s(i-1)=s(i)s(i+1) …
Candidate sequences where the last block has 3 elements:

…
Candidate sequences where the last block has 4 elements:

…

…
Candidate sequences where last block has n+1 elements:

s(1)s(2)s(3)………s(i+1)

For each possibility, the algorithm stops when the sequence block is no longer repeated. And that’s it.

The algorithm will be some thing like this in psude-c code:

B(0) = “”
for (i=1; i<=N; i++) {
    // Calculate all the candidates for B(i)
    BestCandidate=null
    for (j=1; j<=i; j++) {
        Calculate all the candidates of length (i)

        r=1;
        do {
             Candidadte = B([i-j]*r-1) s(i-j+1)…s(i-1)s(i) r
             If (   (BestCandidate==null) 
                      || (Candidate is shorter that BestCandidate)) 
                 {
            BestCandidate=Candidate.
                 }
             r++;
        } while (  ([i-j]*r <= i) 
             &&(s(i-j*r+1) s(i-j*r+2)…s(i-j*r+j) == s(i-j+1) s(i-j+2)…s(i-j+j))

    }
    B(i)=BestCandidate
}

Hope that this can help a little more.

The full C program performing the required task is given below. It runs in O(n^2). The central part is only 30 lines of code.

EDIT I have restructured a little bit the code, changed the names of the variables and added some comment in order to be more readable.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>


// This struct represents a compressed segment like atg4, g3,  agc1
    struct Segment {
        char *elements;
        int nElements;
        int count;
    };
         // As an example, for the segment agagt3  elements would be:
         // {
         //      elements: "agagt",
         //      nElements: 5,
         //      count: 3
         // }

    struct Sequence {
        struct Segment lastSegment;
        struct Sequence *prev;      // Points to a sequence without the last segment or NULL if it is the first segment
        int totalLen;  // Total length of the compressed sequence.
    };
       // as an example, for the sequence agt32ta5, the representation will be:
       // {
       //     lastSegment:{"ta" , 2 , 5},
       //     prev: @A,
       //     totalLen: 8
       // }
       // and A will be
       // {
       //     lastSegment{ "agt", 3, 32},
       //     prev: NULL,
       //     totalLen: 5
       // }


// This function converts a sequence to a string.
// You have to free the string after using it.
// The strategy is to construct the string from right to left.

char *sequence2string(struct Sequence *S) {
    char *Res=malloc(S->totalLen + 1);
    char *digits="0123456789";

    int p= S->totalLen;
    Res[p]=0;

    while (S!=NULL) {
            // first we insert the count of the last element.
            // We do digit by digit starting with the units.
            int C = S->lastSegment.count;
            while (C) {
                p--;
                Res[p] = digits[ C % 10 ];
                C /= 10;
            }

            p -= S->lastSegment.nElements;
            strncpy(Res + p , S->lastSegment.elements, S->lastSegment.nElements);

            S = S ->prev;
    }


    return Res;
}


// Compresses a dna sequence.
// Returns a string with the in sequence compressed.
// The returned string must be freed after using it.
char *dnaCompress(char *in) {
    int i,j;

    int N = strlen(in);;            // Number of elements of a in sequence.



    // B is an array of N+1 sequences where B(i) is the best compressed sequence sequence of the first i characters.
    // What we want to return is B[N];
    struct Sequence *B;
    B = malloc((N+1) * sizeof (struct Sequence));

    // We first do an initialization for i=0

    B[0].lastSegment.elements="";
    B[0].lastSegment.nElements=0;
    B[0].lastSegment.count=0;
    B[0].prev = NULL;
    B[0].totalLen=0;

    // and set totalLen of all the sequences to a very HIGH VALUE in this case N*2 will be enougth,  We will try different sequences and keep the minimum one.
    for (i=1; i<=N; i++) B[i].totalLen = INT_MAX;   // A very high value

    for (i=1; i<=N; i++) {

        // at this point we want to calculate B[i] and we know B[i-1], B[i-2], .... ,B[0]
        for (j=1; j<=i; j++) {

            // Here we will check all the candidates where the last segment has j elements

            int r=1;                  // number of times the last segment is repeated
            int rNDigits=1;           // Number of digits of r
            int rNDigitsBound=10;     // We will increment r, so this value is when r will have an extra digit.
                                      // when r = 0,1,...,9 => rNDigitsBound = 10
                                      // when r = 10,11,...,99 => rNDigitsBound = 100
                                      // when r = 100,101,.,999 => rNDigitsBound = 1000 and so on.

            do {

                // Here we analitze a candidate B(i).
                // where the las segment has j elements repeated r times.

                int CandidateLen = B[i-j*r].totalLen + j + rNDigits;
                if (CandidateLen < B[i].totalLen) {
                    B[i].lastSegment.elements = in + i - j*r;
                    B[i].lastSegment.nElements = j;
                    B[i].lastSegment.count = r;
                    B[i].prev = &(B[i-j*r]);
                    B[i].totalLen = CandidateLen;
                }

                r++;
                if (r == rNDigitsBound ) {
                    rNDigits++;
                    rNDigitsBound *= 10;
                }
            } while (   (i - j*r >= 0)
                     && (strncmp(in + i -j, in + i - j*r, j)==0));

        }
    }

    char *Res=sequence2string(&(B[N]));
    free(B);

    return Res;
}

int main(int argc, char** argv) {
    char *compressedDNA=dnaCompress(argv[1]);
    puts(compressedDNA);
    free(compressedDNA);
    return 0;
}

+1, seems to work good enough, also C is a good language to implement this, although a bit hard to read. Eg. what is that `ss` variable. Is this solution optimal? It's a bit too easy on the OP to just hand him a solution like this. He should give you credit in his paper :-))) — Boris Stitnicky, Jun 03 '13 at 19:31
Seeing the difference between your current and deserved reputation, gotta give you a bounty for this. — Boris Stitnicky, Jun 03 '13 at 19:38
I promise to try to explain a litle more the algorithm. Any way, s is the number of digits of r and ss is the next number that will increese the number of ciphers of r by one. — jbaylina, Jun 03 '13 at 22:07
The algorithm should give the optimum. And if you want to limit the maximum sequence length that can be repeated as you said in one of your comments, you can just limit the second for loop to run a maximum of N (Where N es the maximum sequence). so the line woud be: for (j=i-i, (j>0)&&(i-j < N), j++) { . This should improve the performance to just O(n*N) — jbaylina, Jun 04 '13 at 00:01
@jbaylina: So it is dynamic programming after all. Up until now, I wrote very little C in my life and could not understand your code. This kind of questions and answer are what makes SO interesting. After you explained the magic trick, it sounds almost like naïve algorithm :-). But, I wonder, can this be done in shorther O? Such as n log n or linear, using suffix trees or something? — Boris Stitnicky, Jun 05 '13 at 00:20
@BorisStitnicky Yes, the C-code is not very clear. I have pending to comment it and clean it a little. — jbaylina, Jun 06 '13 at 16:26
@BorisStitnicky About the question if it can be improved, here is what I would do to improve it: In the second loop, I believe that it is not always necessary to go down to 0. A bound can be found that ensure that it is not necessary to continue the loop. I believe that this improvement can work in sequences that can be compressed by more than 2. (I have no idea if dna satisfies this condition…). Any way, this working line would need a little of study but I believe that O(n^2) can be improved in a high compressable sequence. — jbaylina, Jun 06 '13 at 16:26
@jbaylina can you please elaborate how the algorithm is working..it is difficult to understand..and explain the usage of variables in the code — user2442890, Jun 06 '13 at 17:37
@user2442890: Stack Overflow is amazing, isn't it? Imho, his algorithm explanation is already pretty good, maybe some formatting... — Boris Stitnicky, Jun 06 '13 at 22:22
@user2442890 I have changed the C code in order to be more readable. I have tested a little, but plese test it yourserf to verify that it works. And fill free to improve the format. — jbaylina, Jun 07 '13 at 09:51
@BorisStitnicky @jbaylina @user2442890 ...just as a side note -- I continued to work with my limited skills -- please see this example of two different compressions of real data: http://pastebin.com/JFirziJb --What I find interesting is that the shorter compression is actually much closer to the original string, leaving many single note repetitions the same, rather than opting for probably many `1`'s that would be added from too much compression. jbaylina, could you paste somewhere the result of your algorithm on the example `str`? — גלעד ברקן, Jun 10 '13 at 01:52
@groovy: Don't ask our start to do menial work for you, just compile it with `gcc` and get the compression by yourself. If you want to continue being interested in this problem, you could try to transit from O(n^2) to O(n.log n), or linear time via postfix trees or something. But afaiac, O(n^2) is good enough. — Boris Stitnicky, Jun 10 '13 at 02:17
I have been thinking a little bit about optimizing the solution, and I am quite sure that it can be improved. As @BorisStitnicky pointed out, suffix tree can be used to find the repeated sequences with a cost of O(n*log(n)). With that information you can improve the algorithm that I proposed. — jbaylina, Jun 10 '13 at 12:44
@groovy Copy the C code to a file called dna.c and from the command line of a *nix machine just compili it: "gcc dna.c -o dna" Change the permisions of the executable: "chmod 755 dna" and execute with the first parameter the string you want to compress: "./dna dadadagg" — jbaylina, Jun 10 '13 at 12:50
@jbaylina the algorithm working good...i got the working of the algorithm..can u explain the working of this algorithm by taking some sample sequences — user2442890, Jun 15 '13 at 15:34

score 2 · Answer 2 · answered Jun 02 '13 at 05:26

Forget Ukonnen. Dynamic programming it is. With 3-dimensional table:

sequence position
subsequence size
number of segments

TERMINOLOGY: For example, having a = "aaagctgctagag", sequence position coordinate would run from 1 to 13. At sequence position 3 (letter 'g'), having subsequence size 4, the subsequence would be "gctg". Understood? And as for the number of segments, then expressing a as "aaagctgctagag1" consists of 1 segment (the sequence itself). Expressing it as "a3gct2ag2" consists of 3 segments. "aaagctgct1ag2" consists of 2 segments. "a2a1ctg2ag2" would consist of 4 segments. Understood? Now, with this, you start filling a 3-dimensional array 13 x 13 x 13, so your time and memory complexity seems to be around n ** 3 for this. Are you sure you can handle it for million-bp sequences? I think that greedy approach would be better, because large DNA sequences are unlikely to repeat exactly. And, I would suggest that you widen your assignment to approximate matches, and you can publish it straight in a journal.

Anyway, you will start filling the table of compressing a subsequence starting at some position (dimension 1) with length equal to dimension 2 coordinate, having at most dimension 3 segments. So you first fill the first row, representing compressions of subsequences of length 1 consisting of at most 1 segment:

a        a        a        g        c        t        g        c        t        a        g        a        g
1(a1)    1(a1)    1(a1)    1(g1)    1(c1)    1(t1)    1(g1)    1(c1)    1(t1)    1(a1)    1(g1)    1(a1)    1(g1)

The number is the character cost (always 1 for these trivial 1-char sequences; number 1 does not count into the character cost), and in the parenthesis, you have the compression (also trivial for this simple case). The second row will be still simple:

2(a2)    2(a2)    2(ag1)   2(gc1)   2(ct1)   2(tg1)   2(gc1)   2(ct1)   2(ta1)   2(ag1)   2(ga1)    2(ag1)

There is only 1 way to decompose a 2-character sequence into 2 subsequences -- 1 character + 1 character. If they are identical, the result is like a + a = a2. If they are different, such as a + g, then, because only 1-segment sequences are admissible, the result cannot be a1g1, but must be ag1. The third row will be finally more interesting:

2(a3)    2(aag1)  3(agc1)  3(gct1)  3(ctg1)  3(tgc1)  3(gct1)  3(cta1)  3(tag1)  3(aga1)  3(gag1)

Here, you can always choose between 2 ways of composing the compressed string. For example, aag can be composed either as aa + g or a + ag. But again, we cannot have 2 segments, as in aa1g1 or a1ag1, so we must be satisfied with aag1, unless both components consist of the same character, as in aa + a => a3, with character cost 2. We can continue onto 4 th line:

4(aaag1) 4(aagc1) 4(agct1) 4(gctg1) 4(ctgc1) 4(tgct1) 4(gcta1) 4(ctag1) 4(taga1) 3(ag2)

Here, on the first position, we cannot use a3g1, because only 1 segment is allowed at this layer. But at the last position, compression to character cost 3 is agchieved by ag1 + ag1 = ag2. This way, one can fill the whole first-level table all the way up to the single subsequence of 13 characters, and each subsequence will have its optimal character cost and its compression under the first-level constraint of at most 1 segment associated with it.

Then you go to the 2nd level, where 2 segments are allowed... And again, from the bottom up, you identify the optimum cost and compression of each table coordinate under the given level's segment count constraint, by comparing all the possible ways to compose the subsequence using already computed positions, until you fill the table completely and thus compute the global optimum. There are some details to solve, but sorry, I'm not gonna code this for you.

This would be about O(n^4) as calculation of every element contains an inner loop. — Egor Skriptunoff, Jun 02 '13 at 21:53
@EgorSkriptunoff: Which might perhaps be avoided. Perhaps you can even make it in n^2, and if you are not squeamish, you just check the subsequences up to certain length (let's say 4000 bp), and you make saving in the exponent. But I've already spent entirely too much time thinking about this question. That's why I mentioned that there are some details to solve. This problem has actually suspiciously linear structure, and I suspect that much better solution that this O(n^3) (or maybe n^4, as you say) exist. The question is just to find one that would be good enough. Exponential time won't do. — Boris Stitnicky, Jun 02 '13 at 21:59
@EgorSkriptunoff: But yes, I must give it to you, it seems to be `n ** 4`, as multiple compositions will have to be compared, or perhaps even higher exponent, up to 5, I'm really really lazy to count. — Boris Stitnicky, Jun 02 '13 at 22:04

גלעד ברקן · Answer 3 · 2013-06-24T01:56:35.093

After trying my own way for a while, my kudos to jbaylina for his beautiful algorithm and C implementation. Here's my attempted version of jbaylina's algorithm in Haskell, and below it further development of my attempt at a linear-time algorithm that attempts to compress segments that include repeated patterns in a one-by-one fashion:

import Data.Map (fromList, insert, size, (!))

compress s = (foldl f (fromList [(0,([],0)),(1,([s!!0],1))]) [1..n - 1]) ! n  
 where
  n = length s
  f b i = insert (size b) bestCandidate b where
    add (sequence, sLength) (sequence', sLength') = 
      (sequence ++ sequence', sLength + sLength')
    j' = [1..min 100 i]
    bestCandidate = foldr combCandidates (b!i `add` ([s!!i,'1'],2)) j'
    combCandidates j candidate' = 
      let nextCandidate' = comb 2 (b!(i - j + 1) 
                       `add` ((take j . drop (i - j + 1) $ s) ++ "1", j + 1))
      in if snd nextCandidate' <= snd candidate' 
            then nextCandidate' 
            else candidate' where
        comb r candidate
          | r > uBound                         = candidate
          | not (strcmp r True)                = candidate
          | snd nextCandidate <= snd candidate = comb (r + 1) nextCandidate
          | otherwise                          = comb (r + 1) candidate
         where 
           uBound = div (i + 1) j
           prev = b!(i - r * j + 1)
           nextCandidate = prev `add` 
             ((take j . drop (i - j + 1) $ s) ++ show r, j + length (show r))
           strcmp 1   _    = True
           strcmp num bool 
             | (take j . drop (i - num * j + 1) $ s) 
                == (take j . drop (i - (num - 1) * j + 1) $ s) = 
                  strcmp (num - 1) True
             | otherwise = False

Output:

*Main> compress "aaagctgctagag"
("a3gct2ag2",9)

*Main> compress "aaabbbaaabbbaaabbbaaabbb"
("aaabbb4",7)

Linear-time attempt:

import Data.List (sortBy)

group' xxs sAccum (chr, count)
  | null xxs = if null chr 
                  then singles
                  else if count <= 2 
                          then reverse sAccum ++ multiples ++ "1"
                  else singles ++ if null chr then [] else chr ++ show count
  | [x] == chr = group' xs sAccum (chr,count + 1)
  | otherwise = if null chr 
                   then group' xs (sAccum) ([x],1) 
                   else if count <= 2 
                           then group' xs (multiples ++ sAccum) ([x],1)
                   else singles 
                        ++ chr ++ show count ++ group' xs [] ([x],1)
 where x:xs = xxs
       singles = reverse sAccum ++ (if null sAccum then [] else "1")
       multiples = concat (replicate count chr)

sequences ws strIndex maxSeqLen = repeated' where
  half = if null . drop (2 * maxSeqLen - 1) $ ws 
            then div (length ws) 2 else maxSeqLen
  repeated' = let (sequence,(sequenceStart, sequenceEnd'),notSinglesFlag) = repeated
              in (sequence,(sequenceStart, sequenceEnd'))
  repeated = foldr divide ([],(strIndex,strIndex),False) [1..half]
  equalChunksOf t a = takeWhile(==t) . map (take a) . iterate (drop a)
  divide chunkSize b@(sequence,(sequenceStart, sequenceEnd'),notSinglesFlag) = 
    let t = take (2*chunkSize) ws
        t' = take chunkSize t
    in if t' == drop chunkSize t
          then let ts = equalChunksOf t' chunkSize ws
                   lenTs = length ts
                   sequenceEnd = strIndex + lenTs * chunkSize
                   newEnd = if sequenceEnd > sequenceEnd' 
                            then sequenceEnd else sequenceEnd'
               in if chunkSize > 1 
                     then if length (group' (concat (replicate lenTs t')) [] ([],0)) > length (t' ++ show lenTs)
                             then (((strIndex,sequenceEnd,chunkSize,lenTs),t'):sequence, (sequenceStart,newEnd),True)
                             else b
                     else if notSinglesFlag
                             then b
                             else (((strIndex,sequenceEnd,chunkSize,lenTs),t'):sequence, (sequenceStart,newEnd),False)
          else b

addOne a b
  | null (fst b) = a
  | null (fst a) = b
  | otherwise = 
      let (((start,end,patLen,lenS),sequence):rest,(sStart,sEnd)) = a 
          (((start',end',patLen',lenS'),sequence'):rest',(sStart',sEnd')) = b
      in if sStart' < sEnd && sEnd < sEnd'
            then let c = ((start,end,patLen,lenS),sequence):rest
                     d = ((start',end',patLen',lenS'),sequence'):rest'
                 in (c ++ d, (sStart, sEnd'))
            else a

segment xs baseIndex maxSeqLen = segment' xs baseIndex baseIndex where
  segment' zzs@(z:zs) strIndex farthest
    | null zs                              = initial
    | strIndex >= farthest && strIndex > 0 = ([],(0,0))
    | otherwise                            = addOne initial next
   where
     next@(s',(start',end')) = segment' zs (strIndex + 1) farthest'
     farthest' | null s = farthest
               | otherwise = if start /= end && end > farthest then end else farthest
     initial@(s,(start,end)) = sequences zzs strIndex maxSeqLen

areExclusive ((a,b,_,_),_) ((a',b',_,_),_) = (a' >= b) || (b' <= a)

combs []     r = [r]
combs (x:xs) r
  | null r    = combs xs (x:r) ++ if null xs then [] else combs xs r
  | otherwise = if areExclusive (head r) x
                   then combs xs (x:r) ++ combs xs r
                        else if l' > lowerBound
                                then combs xs (x: reduced : drop 1 r) ++ combs xs r
                                else combs xs r
 where lowerBound = l + 2 * patLen
       ((l,u,patLen,lenS),s) = head r
       ((l',u',patLen',lenS'),s') = x
       reduce = takeWhile (>=l') . iterate (\x -> x - patLen) $ u
       lenReduced = length reduce
       reduced = ((l,u - lenReduced * patLen,patLen,lenS - lenReduced),s)

buildString origStr sequences = buildString' origStr sequences 0 (0,"",0)
   where
    buildString' origStr sequences index accum@(lenC,cStr,lenOrig)
      | null sequences = accum
      | l /= index     = 
          buildString' (drop l' origStr) sequences l (lenC + l' + 1, cStr ++ take l' origStr ++ "1", lenOrig + l')
      | otherwise      = 
          buildString' (drop u' origStr) rest u (lenC + length s', cStr ++ s', lenOrig + u')
     where
       l' = l - index
       u' = u - l  
       s' = s ++ show lenS       
       (((l,u,patLen,lenS),s):rest) = sequences

compress []         _         accum = reverse accum ++ (if null accum then [] else "1")
compress zzs@(z:zs) maxSeqLen accum
  | null (fst segment')                      = compress zs maxSeqLen (z:accum)
  | (start,end) == (0,2) && not (null accum) = compress zs maxSeqLen (z:accum)
  | otherwise                                =
      reverse accum ++ (if null accum || takeWhile' compressedStr 0 /= 0 then [] else "1")
      ++ compressedStr
      ++ compress (drop lengthOriginal zzs) maxSeqLen []
 where segment'@(s,(start,end)) = segment zzs 0 maxSeqLen
       combinations = combs (fst $ segment') []
       takeWhile' xxs count
         | null xxs                                             = 0
         | x == '1' && null (reads (take 1 xs)::[(Int,String)]) = count 
         | not (null (reads [x]::[(Int,String)]))               = 0
         | otherwise                                            = takeWhile' xs (count + 1) 
        where x:xs = xxs
       f (lenC,cStr,lenOrig) (lenC',cStr',lenOrig') = 
         let g = compare ((fromIntegral lenC + if not (null accum) && takeWhile' cStr 0 == 0 then 1 else 0) / fromIntegral lenOrig) 
                         ((fromIntegral lenC' + if not (null accum) && takeWhile' cStr' 0 == 0 then 1 else 0) / fromIntegral lenOrig')
         in if g == EQ 
               then compare (takeWhile' cStr' 0) (takeWhile' cStr 0)
               else g
       (lenCompressed,compressedStr,lengthOriginal) = 
         head $ sortBy f (map (buildString (take end zzs)) (map reverse combinations))

Output:

*Main> compress "aaaaaaaaabbbbbbbbbaaaaaaaaabbbbbbbbb" 100 []
"a9b9a9b9"

*Main> compress "aaabbbaaabbbaaabbbaaabbb" 100 []
"aaabbb4"

How about "aaabbbaaabbbaaabbbaaabbb", where the correct answer is "aaabbb4"? — Boris Stitnicky, Jun 02 '13 at 07:13
@BorisStitnicky Thank you for noticing that. I changed `divided = foldr divide [] [div lenOne 2, div lenOne 2 - 1..2]` to `divided = foldr divide [] [2..div lenOne 2]` and now it displays `aaabbb4`. But could that change cause other inaccuracies? — גלעד ברקן, Jun 02 '13 at 10:28
@BorisStitnicky ..it was more complicated than I thought..I took another stab at it — גלעד ברקן, Jun 02 '13 at 16:10
You don't have to tell that to me. I was trying to write it in Ruby, greedily, but the computational complexity without dynamic programming is just so crappy. Since the problem is not a textbook assignment, slow solutions don't matter, so I gave up. — Boris Stitnicky, Jun 02 '13 at 17:29
@BorisStitnicky I made a slight modification. It now outputs to file the compressed version for a random 10^6-character string in about one minute. Should it be much faster? — גלעד ברקן, Jun 03 '13 at 18:15
1 minute running time for 10^6 character string would indicate near-linear runing time. I thus reckon your code takes a greedy approach, which is a deviation from the OP's problem statement, but perhaps a welcome and satisfactory deviation. (In other words, you have solved an easier problem, which might be sufficient in reality.) However, let me warn you, that a random string is _not_ a good representative of typical DNA sequences, where short and long distance repetitions abound. The amount of coding effort you have put into this question is commendable, +1 from me. — Boris Stitnicky, Jun 03 '13 at 18:38

how to match dna sequence pattern

3 Answers3

Linked