8

I am looking for a fast algorithm for search purpose in a huge string (it's a organism genome sequence composed of hundreds of millions to billions of chars).

There are only 4 chars {A,C,G,T} present in this string, and "A" can only pair with "T" while "C" pairs with "G".

Now I am searching for two substrings (with length constraint of both substring between {minLen, maxLen}, and interval length between {intervalMinLen, intervalMaxLen}) that can pair with one another antiparallely.

For example, The string is: ATCAG GACCA TACGC CTGAT

Constraints: minLen = 4, maxLen = 5, intervalMinLen = 9, intervalMaxLen = 10

The result should be

  1. "ATCAG" pair with "CTGAT"

  2. "TCAG" pair with "CTGA"

Thanks in advance.

Update: I already have the method to determine whether two string can pair with one another. The only concern is doing exhaustive search is very time consuming.

Nate
  • 1,889
  • 1
  • 20
  • 40
Mavershang
  • 1,266
  • 2
  • 15
  • 27
  • 4
    It sounds like a job for Regex (System.Text.RegularExpressions.Regex). This isn't homework by chance, is it? – Mark Avenius Jan 10 '12 at 22:20
  • What is the typical values for minLen, maxLen, intervalMinLen and intervalMaxLen ? – ElKamina Jan 10 '12 at 22:23
  • Those length constraints could be any value theoretically. But generally minLen and maxLen should between {0, 30}, intervalMinLen and intervalMaxLen could be hundreds to several thousands – Mavershang Jan 10 '12 at 22:25
  • What do intervalMinLen and intervalMaxLen control? – Russell Borogove Jan 10 '12 at 22:25
  • The interval length between the two pairing substrings. – Mavershang Jan 10 '12 at 22:27
  • Will you be doing MANY searches on ONE dataset? The background is, that in this case a costly transformation of the dataset, that speeds up the search is acceptable. Just to clear up the wording: How many hits do you expect per complete search cycle. – Eugen Rieck Jan 10 '12 at 22:31
  • @Mark: "That ringing in your ears is [Jamie Zawinski](http://codemines.blogspot.com/2006/08/now-they-have-two-problems.html) saying "Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems." – Tim Schmelter Jan 10 '12 at 23:02
  • BLAST doesn't make sense for this purpose- he's a) not looking for a specific gene and gene complement, but rather _any_ pair of inverted repeats within a certain range of each other, no matter their sequence, and b) he's looking for exact inverted repeats, not approximate, so BLAST would be inefficient (not to mention ineffective for very short sequences). ETA: Forgive the "he" above, as I don't know whether Mavershang is male or female. – David Robinson Jan 10 '12 at 23:19
  • Incidentally, [this paper](www.ncbi.nlm.nih.gov/pmc/articles/PMC524409/) is solving a similar problem, but it is ancient (2004), the software doesn't appear to be particularly well maintained, and basically no details of the algorithm are provided – David Robinson Jan 10 '12 at 23:22
  • 1
    you could also ask http://biostar.stackexchange.com – Pierre Jan 11 '12 at 07:30

5 Answers5

4

I know you aren't searching for substrings, but I think it might be worthwhile to create a reversed genome string containing the matches; the task would then be to find common substrings in the two strings.

Example:

Original string

  ATCAG GACCA TACGC CTGAT

Reversed string:

  TAGTC CGCAT ACCAG GACTA

If you then transform the string to it's pairing values (replace T<->A and C<->G, you get something useful:

  ATCAG GCGTA TGGTC CTGAT

I know that this preprocessing will be costly and consume a lot of space, but you will be able to use standard string algorithms afterwards and with the amount of comparisons you are searching, it certainly will be justfied.

When the original string and the reversed lookup string I think your problem sounds surprisingly alike to the 'longest common substring' problem which is well described. Your second preprocessing would be to construct a suffix tree to allow fast lookup of substrings.

you will end up with quadratic running times, but I doubt you can do better

Community
  • 1
  • 1
faester
  • 14,886
  • 5
  • 45
  • 56
3

Easiest solution would be to construct a Trie from the data of maximum height maxLen. Each node should also have a list of locations where that particular string was found (it will be in ascending order, if the trie is created in order).

Then, while searching, just reverse compliment the search string and go through the trie. When you find a match check if one of the matches is located in proper interval, if 'yes' then output the strings.

Let me know if you need the detailed algorithm.

ElKamina
  • 7,747
  • 28
  • 43
  • `compliment the string and go through the trie`. From the examples given, it seems that the two strings need not match in order, but rather need only have the same number of each complemented character (`AC` can match `GT`). Just going through the trie will not accomplish this, don't you think? – efficiencyIsBliss Jan 11 '12 at 02:33
  • @efficiencyIsBliss Thanks for pointing this out. It should "reverse compliment the string and go through the trie. – ElKamina Jan 11 '12 at 03:45
3

The method is described in http://www.ncbi.nlm.nih.gov/pubmed/11381028

Genome Res. 2001 Jun;11(6):1005-17. Segmental duplications: organization and impact within the current human genome project assembly.

Pierre
  • 34,472
  • 31
  • 113
  • 192
2

I thought this was an interesting problem, so I put together a program based on considering 'foldings', which scans outward for possible symmetrical matches from different 'fold points'. If N is the number of nucleotides and M is 'maxInterval-minInterval', you should have running time O(N*M). I may have missed some boundary cases, so use the code with care, but it does work for the example provided. Note that I've used a padded intermediate buffer to store the genome, as this reduces the number of comparisons for boundary cases required in the inner loops; this trades off additional memory allocation for better speed. Feel free to edit the post if you make any corrections or improvements.

class Program
{
    public sealed class Pairing
    {
        public int Index { get; private set; }

        public int Length { get; private set; }

        public int Offset { get; private set; }

        public Pairing(int index, int length, int offset)
        {
            Index = index;
            Length = length;
            Offset = offset;
        }
    }

    public static IEnumerable<Pairing> FindPairings(string genome, int minLen, int maxLen, int intervalMinLen, int intervalMaxLen)
    {
        int n = genome.Length;
        var padding = new string((char)0, maxLen);
        var padded = string.Concat(padding, genome, padding);

        int start = (intervalMinLen + minLen)/2 + maxLen;
        int end = n - (intervalMinLen + minLen)/2 + maxLen;

        //Consider 'fold locations' along the genome
        for (int i=start; i<end; i++)
        {
            //Consider 'odd' folding (centered on index) about index i
            int k = (intervalMinLen+2)/2;
            int maxK = (intervalMaxLen + 2)/2;
            while (k<=maxK)
            {
                int matchLength = 0;
                while (IsPaired(padded[i - k], padded[i + k]) && (k <= (maxK+maxLen)))
                {
                    matchLength++;

                    if (matchLength >= minLen && matchLength <= maxLen)
                    {
                        yield return new Pairing(i-k - maxLen, matchLength, 2*k - (matchLength-1));
                    }
                    k++;
                }
                k++;
            }

            //Consider 'even' folding (centered before index) about index i
            k = (intervalMinLen+1)/2;
            while (k <= maxK)
            {
                int matchLength = 0;
                while (IsPaired(padded[i - (k+1)], padded[i + k]) && (k<=maxK+maxLen))
                {
                    matchLength++;

                    if (matchLength >= minLen && matchLength <= maxLen)
                    {
                        yield return new Pairing(i - (k+1) - maxLen, matchLength, 2*k + 1 - (matchLength-1));
                    }
                    k++;
                }
                k++;
            }
        }
    }

    private const int SumAT = 'A' + 'T';
    private const int SumGC = 'G' + 'C';
    private static bool IsPaired(char a, char b)
    {
        return (a + b) == SumAT || (a + b) == SumGC;
    }


    static void Main(string[] args)
    {
        string genome = "ATCAGGACCATACGCCTGAT";
        foreach (var pairing in FindPairings(genome, 4, 5, 9, 10))
        {
            Console.WriteLine("'{0}' pair with '{1}'",
                              genome.Substring(pairing.Index, pairing.Length),
                              genome.Substring(pairing.Index + pairing.Offset, pairing.Length));
        }
        Console.ReadKey();
    }


}
Dan Bryant
  • 27,329
  • 4
  • 56
  • 102
1

I'd consider converting the strings to binary in 16 bit lengths:

A = 101
T = 010
C = 110
G = 001

This allows up to 5 characters per 16 bit unit. This should be very fast compared to string searches and comparisons.

Use the top bit to determine if it's a 4 sequence unit or 5 sequence unit.

Now you can do a quick sort and get all the 4 sequence and 5 sequence units into their own groups (unless you need to find 4 sequence units that partial match against 5 sequence units).

For the comparison you can sort again, and you'll find that all the sequences that start with G will come before sequences that start with T, then A then C. You can then take a sequence and compare it to just the parts of the sorted array that are possible - which should be a much, much shorter problem space.

Further, the reason I've chosen those three bit encodings for the sequences is that you can simply xor the two strings together to see if they match. If you get 15 1's in sequence, then the two match perfectly. If not, then they don't.

You'll have to figure out the anti-parallel bit - I have a few thoughts on that, but this should get you started, and if the anit-parallel part becomes an issue, ask another question.

Adam Davis
  • 91,931
  • 60
  • 264
  • 330
  • Some of this may be useful or not, as I'm sure I've missed some aspects of the question and gene sequencing in general, but the encoding and comparison should at least make comparison extraordinarily fast on today's processors. – Adam Davis Jan 10 '12 at 22:34
  • 1
    Why not A=00; T=01; C=10; G=11 ? – phoog Jan 10 '12 at 22:40
  • 1
    Why not just 2 bits (A:00, T:11, G:01, C:10) ? – ElKamina Jan 10 '12 at 22:44
  • The problem with two bits is that you can't then xor two sequences together and get a set result. Note that my encoding is symmetrical. This allows a simple logic operation to deduce if two sequences match A-T and G-C. – Adam Davis Jan 10 '12 at 23:20
  • So does mine. A.xor.T = 11, G.xor.C = 11 – ElKamina Jan 10 '12 at 23:43
  • 2
    The reason to use three bits is primarily because one of the issues the OP has it that one set of sequences has to be reversed, antiparallel. 3 bits gives a clear indication of what character is represented whether it's forwards or backwards, which can be useful depending on the sort and comparison operations needed. With only two bits, you can't tell if it's a C or reversed G. – Adam Davis Jan 11 '12 at 02:41