16

I got curious by Jon Limjap's interview mishap and started to look for efficient ways to do palindrome detection. I checked the palindrome golf answers and it seems to me that in the answers are two algorithms only, reversing the string and checking from tail and head.

def palindrome_short(s):
    length = len(s)
    for i in xrange(0,length/2):
        if s[i] != s[(length-1)-i]: return False
    return True

def palindrome_reverse(s):
    return s == s[::-1]

I think neither of these methods are used in the detection of exact palindromes in huge DNA sequences. I looked around a bit and didn't find any free article about what an ultra efficient way for this might be.

A good way might be parallelizing the first version in a divide-and-conquer approach, assigning a pair of char arrays 1..n and length-1-n..length-1 to each thread or processor.

What would be a better way?

Do you know any?

Community
  • 1
  • 1
Vinko Vrsalovic
  • 330,807
  • 53
  • 334
  • 373

10 Answers10

7

Given only one palindrome, you will have to do it in O(N), yes. You can get more efficiency with multi-processors by splitting the string as you said.

Now say you want to do exact DNA matching. These strings are thousands of characters long, and they are very repetitive. This gives us the opportunity to optimize.

Say you split a 1000-char long string into 5 pairs of 100,100. The code will look like this:

isPal(w[0:100],w[-100:]) and isPal(w[101:200], w[-200:-100]) ...

etc... The first time you do these matches, you will have to process them. However, you can add all results you've done into a hashtable mapping pairs to booleans:

isPal = {("ATTAGC", "CGATTA"): True, ("ATTGCA", "CAGTAA"): False}

etc... this will take way too much memory, though. For pairs of 100,100, the hash map will have 2*4^100 elements. Say that you only store two 32-bit hashes of strings as the key, you will need something like 10^55 megabytes, which is ridiculous.

Maybe if you use smaller strings, the problem can be tractable. Then you'll have a huge hashmap, but at least palindrome for let's say 10x10 pairs will take O(1), so checking if a 1000 string is a palindrome will take 100 lookups instead of 500 compares. It's still O(N), though...

Smart Manoj
  • 5,230
  • 4
  • 34
  • 59
Claudiu
  • 224,032
  • 165
  • 485
  • 680
  • 4
    You are forgetting that hash lookup is linear in the length of the key and since hash calculation uses some arithmetics it's actually less efficient than char-by-char comparison. Also chunking won't help even if you partalelize since for every miss you'll have huge amount of wasted work and there's much more misses than hits. Comparing from the center is much more efficient since you can bail out early. – ZXX Aug 03 '10 at 06:30
3

Another variant of your second function. We need no check equals of the right parts of normal and reverse strings.

def palindrome_reverse(s):
  l = len(s) / 2
  return s[:l] == s[l::-1]
drnk
  • 764
  • 2
  • 9
  • 18
2

There isn't, unless you do a fuzzy match. Which is what they probably do in DNA (I've done EST searching in DNA with smith-waterman, but that is obviously much harder then matching for a palindrome or reverse-complement in a sequence).

nlucaroni
  • 47,556
  • 6
  • 64
  • 86
2

Obviously, you're not going to be able to get better than O(n) asymptotic efficiency, since each character must be examined at least once. You can get better multiplicative constants, though.

For a single thread, you can get a speedup using assembly. You can also do better by examining data in chunks larger than a byte at a time, but this may be tricky due to alignment considerations. You'll do even better to use SIMD, if you can examine chunks as large as 16 bytes at a time.

If you wanted to parallelize it, you could divide the string into N pieces, and have processor i compare the segment [i*n/2, (i+1)*N/2) with the segment [L-(i+1)*N/2, L-i*N/2).

Adam Rosenfield
  • 390,455
  • 97
  • 512
  • 589
  • Instead of comparing chunks of 16 bytes, it's probably faster to do 4 palindromes at a time. It'll save you swizzling data and probably doesn't require as much horizontal operations. – Jasper Bekkers Oct 29 '08 at 20:04
  • Other ideas: Store as much of the key in one machine word as you can. Compare this to each byte of a memory buffer containing the test item. Do not resort to string operations until this hits. Do not use anything wider than 8-bit characters as the limiting factor is going to be memory access. – Loren Pechtel Jul 20 '10 at 00:46
1

They are both in O(N) so I don't think there is any particular efficiency problem with any of these solutions. Maybe I am not creative enough but I can't see how would it be possible to compare N elements in less than N steps, so something like O(log N) is definitely not possible IMHO.

Pararellism might help, but it still wouldn't change the big-Oh rank of the algorithm since it is equivalent to running it on a faster machine.

Tamas Czinege
  • 118,853
  • 40
  • 150
  • 176
0

Comparing from the center is always much more efficient since you can bail out early on a miss but it alwo allows you to do faster max palindrome search, regardless of whether you are looking for the maximal radius or all non-overlapping palindromes.

The only real paralellization is if you have multiple independent strings to process. Splitting into chunks will waste a lot of work for every miss and there's always much more misses than hits.

ZXX
  • 4,684
  • 27
  • 35
  • Can you explain how comparing from the center allows to bail out early? Is there any reason why mismatches near the center would be more probable than mismatches near the edges? – Michał Wróbel Dec 13 '18 at 13:20
  • How do you know that "there's always much more misses than hits"? Has the OP stated anything about the characteristics of the input data? – Michał Wróbel Dec 13 '18 at 13:41
0

On top of what others said, I'd also add a few pre-check criteria for really large inputs :

quick check whether tail-character matches 
                    head character 

if NOT, just early exit by returning Boolean-False

if (input-length < 4) { 

    # The quick check just now already confirmed it's palindrome 

    return Boolean-True  

} else if (200 < input-length) {
     
     # adjust this parameter to your preferences
     #
     # e.g. make it 20 for longer than 8000 etc
     #      or make it scale to input size,
     #      either logarithmically, or a fixed ratio like 2.5%
     #
     reverse last ( N = 4 ) characters/bytes of the input  

     if that **DOES NOT** match first N chars/bytes {

         return boolean-false # early exit
                              # no point to reverse rest of it
                              # when head and tail don't even match
     } else {
         
         if N was substantial

             trim out the head and tail of the input
             you've confirmed; avoid duplicated work

             remember to also update the variable(s)
             you've elected to track the input size  

     }

     [optional 1 : if that substring of N characters you've 
                   just checked happened to all contain the
                   same character, let's call it C,
                    
                   then locate the index position, P, for the first               
                   character that isn't C
                   
                   if P == input-size 

                       then you've already proven
                       the entire string is a nonstop repeat 
                       of one single character, which, by def, 
                       must be a palindrome

                       then just return Boolean-True


                   but the P is more than the half-way point,
                       you've also proven it cannot possibly be a 
                       palindrome, because second half contains a              
                       component that doesn't exist in first half,
                       

                       then just return Boolean-False ]


     [optional 2 : for extremely long inputs, 
                   like over 200,000 chars,
                   take the N chars right at the middle of it,
                   and see if the reversed one matches
                 
                   if that fails, then do early exit and save time ]

 }

 if all pre-checks passed,
 then simply do it BAU style :

     reverse second-half of it, 
     and see if it's same as first half
RARE Kpop Manifesto
  • 2,453
  • 3
  • 11
-1

You can use a hashtable to put the character and have a counter variable whose value increases everytime you find an element not in table/map. If u check and find element thats already in table decrease the count.

For odd lettered string the counter should be back to 1 and for even it should hit 0.I hope this approach is right.

See below the snippet.
s->refers to string
eg: String s="abbcaddc";
Hashtable<Character,Integer> textMap= new Hashtable<Character,Integer>();
        char charA[]= s.toCharArray();
        for(int i=0;i<charA.length;i++)
        {

            if(!textMap.containsKey(charA[i]))
            {   
                textMap.put(charA[i], ++count);

            }
            else
                {
                textMap.put(charA[i],--count);


        }
        if(length%2 !=0)
        {
            if(count == 1)
            System.out.println("(odd case:PALINDROME)");
            else
                System.out.println("(odd case:not palindrome)");
        }
        else if(length%2==0)    
        {
            if(count ==0)
                System.out.println("(even case:palindrome)");
            else
                System.out.println("(even case :not palindrome)");
        }
-1

With Python, short code can be faster since it puts the load into the faster internals of the VM (And there is the whole cache and other such things)

def ispalin(x):
   return all(x[a]==x[-a-1] for a in xrange(len(x)>>1))
Vinko Vrsalovic
  • 330,807
  • 53
  • 334
  • 373
Demur Rumed
  • 351
  • 3
  • 9
-1
public class Palindrome{
    private static boolean isPalindrome(String s){
        if(s == null)
            return false;   //unitialized String ? return false
        if(s.isEmpty())     //Empty Strings is a Palindrome 
            return true;
        //we want check characters on opposite sides of the string 
        //and stop in the middle <divide and conquer>
        int left = 0;  //left-most char    
        int right = s.length() - 1; //right-most char

        while(left < right){  //this elegantly handles odd characters string 
            if(s.charAt(left) != s.charAt(right)) //left char must equal 
                return false; //right else its not a palindrome
            left++; //converge left to right 
            right--;//converge right to left 
        }
        return true; // return true if the while doesn't exit 
    }
}

though we are doing n/2 calculations its still O(n) this can done also using threads, but calculations get messy, best to avoid it. this doesn't test for special characters and is case sensitive. I have code that does it, but this code can be modified to handle that easily.

  • Try to explain more clearly why this is an answer to the question. – Jeroen Heier Nov 29 '18 at 07:08
  • This site is for answering questions people have. Code can help answer some questions, but what's more important is describing the ideas behind the code. – Sneftel Nov 29 '18 at 07:19