-1

what is your best string comparison algorithm?

i find O(n)

#include <string>

bool str_cpmr(char* str1, char* str2)
{
    int l1 = strlen(str1), l2 = strlen(str2) ;          

    if(l1 != l2)
        return false;

    for(int i = 0 ; i < l1 ; i++)
        if(str1[i] != str2[i])
            return false ;

    return true ;
}

and i wonder if there is any other / better solution.

also, how to test that accurately?

i propose to compare

  • 100 matches
  • 100 strings differing by one char swap

is there more to test string compare ?

how is it in stl c++ (slt string::compare) ?

thanks!!!!!

kiriloff
  • 25,609
  • 37
  • 148
  • 229
  • 2
    [confused by c++ tag] I could be wrong, but isn't there `std::string::compare`? Are you including `` only for `strlen`? Use `std::string`s! – Aleksei Zabrodskii Oct 10 '12 at 16:35
  • 1
    Comparing lengths seems a waste as your for loop can just check for EOS. – 001 Oct 10 '12 at 16:36
  • why not have 2 `std::string` objects and then do `s1 == s2` instead of reinventing the wheel. – Naveen Oct 10 '12 at 16:37
  • 1
    You don't need `strlen`. It's duplicating work. – Kerrek SB Oct 10 '12 at 16:37
  • 1
    @tenfour: That's still O(n), it's just got a higher constant multiplier. – Benjamin Lindley Oct 10 '12 at 16:40
  • @tenfour: `strlen` is `O(n)`, since it needs to traverse the string only once. This means that the whole function is `O(n)`, since it does `O(1)` calls to `strlen` (more specifically `2`) and one `O(n)` traverse. @madptr: Why would you ever want to use `char*` instead of `std::string` (actually I can think of a few legimate reasons, but I doubt any of them apply here). – Grizzly Oct 10 '12 at 16:40
  • @tenfour: how is this not `O(n)` with `n = std::max(std::strlen(str1), std::strlen(str2))`. Of course, it can be implemented in `O(m)` time with `m = std::min(std::strlen(str1), std::strlen(str2))`. – Dietmar Kühl Oct 10 '12 at 16:42
  • This code is actually `O(n+m)` where `n` is the length of one string and `m` is the length of the other. It's easy to get `O(min(n,m))` – bames53 Oct 10 '12 at 16:49

5 Answers5

4

You function is O(n), but still takes roughly double the time necessary -- strlen walks through the string to find the length, then (assuming they're the same length) you walk through the strings again comparing the characters.

Instead of that, I'd walk through the strings until you reach a mismatch or the end of both strings. If you reach a mismatch, you return false. You return true if and only if you reach the end of both strings (simultaneously) without any mismatches first.

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • +1: Absolutely, but there is one more thing you can do in front of this: compare the *pointers* passed in before embarking on the walk-thru. If they are the same address, then skip everything and just return true. – WhozCraig Oct 10 '12 at 16:45
  • 1
    @WhozCraig, that will save time if there's a real likelihood of comparing a string to itself; but in code I've written and code I've looked at, even with [string interning](http://en.wikipedia.org/wiki/String_interning) the likelihood is so low that a pre-check like that is a net loss. Eg, if you test 1000 strings of 10 characters, 1 of which is compared with itself, you save about 9 operations, and waste about 999, for net waste of 990 operations. – James Waldby - jwpat7 Oct 10 '12 at 17:18
  • @JerryCoffin regarding end of string: how are you testing for end of string ?tks! – kiriloff Oct 10 '12 at 17:44
  • @jwpat7 I totally concur on the probability weight. Thus such a precondition check is so specialized. If you're specific algorithms being coded are *designed* to potentially pass the same pointer twice then its a win, otherwise it is likely not and that cmp-instruction is going to be a net op-loss. Of course, it is still O(1)+O(n) and therefore O(n). If that extra cmp is that relevant chances are you should be writing this in asm in the first place. – WhozCraig Oct 10 '12 at 17:54
  • @madptr: a byte containing `'\0'` signals the end of a string. – Jerry Coffin Oct 10 '12 at 18:32
3

Logically it's hard to see how you can check all the values in a string for a single char mismatch in less than O(n) time - assuming you have no other info about the string.

If this is a real application and you have some knowledge of the strngs and the type of differences you could do better on average by checking every Nth char first if you know that it contains sequences of length 'N' eg part or phone numbers.

edit: Note this is still O(n), O() only describes the power of the scaling, it would just be O(n/N) which is still O(n). If you make the string 10x longer checking every Nth entry still takes 10x as long.

Martin Beckett
  • 94,801
  • 28
  • 188
  • 263
3

what is your best string comparison algorithm?

template< class T, class Alloc > bool operator==( basic_string<T,Alloc>& lhs, basic_string<T,Alloc>& rhs );.

It compares two strings using only two characters of source code:

a==b;

Here's a non-smartalec answer, written in C:

bool str_cpmr(char* str1, char* str2)
{
  while( *str1 && *str2 && *str1++ == *str2++ )
    ;
  return *str1 == *str2;
}

It is exactly one loop, so it is obviously O(n), where n goes as length of the shorter string. Also, it is likely to compile to exactly 2n memory fetches. You can go faster with specialized string instructions (so calling strcmp() will probably go faster than this), but you probably won't go faster in straight C.

Robᵩ
  • 163,533
  • 20
  • 239
  • 308
2

Your improved function might look like this:

bool str_cpmr(char* str1, char* str2)
{
    if (NULL == str1 || NULL == str2) return false;

    while (*str1 && *str2) {
        if (*str1++ != *str2++) {
            return false;
        }
    }

    return *str1 || *str2 ? false : true;
}
001
  • 13,291
  • 5
  • 35
  • 66
  • 1
    You can do better: for (; *str1 == *str2 && *str1; ++str1, ++str2;){} return *str1 == *str2; // Only two comparisons per loop instead of three. – rici Oct 10 '12 at 17:55
2

If there is no additional information on the nature of the strings, there is nothing that can be better than O(n), where n is the length of the (shorter) string.

You cannot do with less than n comparisons! Give me an algorithm that can do it with n-1 comparisons. Then there must be a position in the string where the algorithm cannot tell if the characters are different or not. That way I can give you an example where you algorithm with n-1 comparisons fails.

You can only improve this by a constant factor. This will also take into account additional information, e.g. if you know that the underlying hardware compares 32-bit values faster than 8-bit values, then it will better to compare chunks of four characters instead of comparing character by character. You will not do much better.

Zane
  • 926
  • 8
  • 21