0

I need help with one problem in Rosalind here is the link to the problem for more information: http://rosalind.info/problems/lcsm/

The following is the sample input:

>Rosalind_1
GATTACA
>Rosalind_2
TAGACCA
>Rosalind_3
ATACA

Sample output is :

AC

and that's the basic information of the input and output:

Given:  A collection of k (k≤100) DNA strings of length at most
        1 kbp each in FASTA format.

Return: A longest common substring of the collection. (If multiple
        solutions exist, you may return any single solution.)

The problem is asking to find the longest shared string that is present in all three sequences, and the common longest string found in this example is AC.

Now, moving to the issue with programming it.

The idea I had - and tried - is to take the first string GATTACA and check all the combination of symbols and check if it is present in the sequence two and three.

How can I do that to loop over all possible bases for all the three sequences and find the longest motif in python?

kdopen
  • 8,032
  • 7
  • 44
  • 52

1 Answers1

0

This is a classic problem in computer science and in Bioinformatics. It is often referred to as the Longest common substring (LCS) problem. In your case this is extended to LCS for >2 sequences.

A technique called "Dynamic programming" is often used for LCS for two sequences and can be extended for >2. Dynamic programming as breaking down a problem into subproblems, solving (and remebering the answers) to the subproblems and finally tracing back to get the answer to the full problem. Another approach is to use a general suffix tree.

I suggest you first look at how this is solved for 2 strings: wikibooks.

Then see how it can be extended for >2. This previous question also seems to be very relevant and has associated code.

Good luck!

Community
  • 1
  • 1
tomnl
  • 44
  • 4