Bit Operation For Finding String Difference

Question

The following string of mine tried to find difference between two strings. But it's horribly slow as it iterate the length of string:

#include <string>
#include <vector>
#include <iostream>
using namespace std;


int hd(string s1, string s2) {
    // hd stands for "Hamming Distance"
    int dif = 0;

    for (unsigned i = 0; i < s1.size(); i++ ) {
        string b1 = s1.substr(i,1);
        string b2 = s2.substr(i,1);

        if (b1 != b2) {
            dif++;
        }
    }  

    return dif;
}

int main() {

    string string1 = "AAAAA";
    string string2 = "ATATT";
    string string3 = "AAAAA";

    int theHD12 = hd(string1,string2);
    cout << theHD12 << endl;

    int theHD13 = hd(string1,string3);
    cout << theHD13 << endl;
}

Is there a fast alternative to do that? In Perl we can have the following approach:

sub hd {
    return ($_[0] ^ $_[1]) =~ tr/\001-\255//;
}

which is much2 faster than iterating the position.

I wonder what's the equivalent of it in C++?

jeez, no wonder it's slow, when you're allocating new strings merely to hold the single `char`s you could get from `operator[]`, at each and every index. — underscore_d, Apr 12 '16 at 11:12

score 12 · Answer 1 · answered Feb 17 '09 at 14:54

12

Try to replace the for loop by:

for (unsigned i = 0; i < s1.size(); i++ ) {
    if (b1[i] != b2[i]) {
            dif++;
    }
}

This should be a lot faster because no new strings are created.

answered Feb 17 '09 at 14:54

schnaader

49,103
10
104
136

lmao, I didn't notice they were allocating 2 x new strings at every index, to hold copies of the `char`s... – underscore_d Apr 12 '16 at 11:13

Éric Malenfant · Accepted Answer · 2009-02-17T15:52:41.087

9

Fun with the STL:

#include <numeric>    //inner_product
#include <functional> //plus, equal_to, not2
#include <string>   
#include <stdexcept>

unsigned int 
hd(const std::string& s1, const std::string& s2)
{
    // TODO: What should we do if s1.size() != s2.size()?
    if (s1.size() != s2.size()){
      throw std::invalid_argument(
          "Strings passed to hd() must have the same lenght"
      );
    }

    return std::inner_product(
        s1.begin(), s1.end(), s2.begin(), 
        0, std::plus<unsigned int>(),
        std::not2(std::equal_to<std::string::value_type>())
    );
}

edited Feb 17 '09 at 15:52

answered Feb 17 '09 at 15:41

Éric Malenfant

13,938
1
40
42

1

7 years later, Samaras has a question: can you explain please? :) I must be very dumb to be the first to ask! :) – gsamaras Nov 23 '16 at 22:05
3

@gsamaras: In its basic version, inner_product calculates the sum of the product of two ranges, A and B: A[0]*B[0]+A[1]*B[1]+... In the generalized version (used here), the two operations (addition and multiplication) are provided by the caller. What we want here is the count of element pairs that are different, so we still want the first operation to be addition (std::plus), but we want the second operation to be "is not equal" (std::not(std::equal_to)) instead of multiplication. – Éric Malenfant Nov 24 '16 at 00:06
I see Eric, thanks, in this [question](http://stackoverflow.com/questions/40773463/how-to-store-binary-data-when-you-only-care-about-speed), a comparison of your function and for-loop and if ! approach is done, using different data structures. – gsamaras Nov 24 '16 at 13:24

score 4 · Answer 3 · edited Sep 20 '12 at 13:06

Use iterators:

int GetHammingDistance(const std::string &a, const std::string &b)
{
    // Hamming distance is not defined for strings of different lengths.
    ASSERT(a.length() == b.length());

    std::string::const_iterator a_it = a.begin();
    std::string::const_iterator b_it = b.begin();

    std::string::const_iterator a_end = a.end();
    std::string::const_iterator b_end = b.end();

    int distance = 0;
    while (a_it != a_end && b_it != b_end)
    {
        if (*a_it != *b_it) ++distance;
        ++a_it; ++b_it;
    }

    return distance;
}

Martin York · Answer 4 · 2009-02-17T16:01:48.897

Choice 1: Modify your original code to be as effecient as possable.

int hd(string const& s1, string const& s2)
{
    // hd stands for "Hamming Distance"
    int dif = 0;

    for (std::string::size_type i = 0; i < s1.size(); i++ )
    {
        char b1 = s1[i];
        char b2 = s2[i];

        dif += (b1 != b2)?1:0;
    }  

    return dif;
}

Second option use some of the STL algorithms to do the heavy lifting.

struct HammingFunc
{
    inline int operator()(char s1,char s2)
    {
        return s1 == s2?0:1;
    }
};

int hd(string const& s1, string const& s2)
{
    int diff = std::inner_product(s1.begin(),s1.end(),
                                  s2.begin(),
                                  0,
                                  std::plus<int>(),HammingFunc()
                                 );
    return diff;
}

score 2 · Answer 5 · edited May 23 '17 at 11:54

2

You use strings.

As explained here The hunt for the fastest Hamming Distance C implementation if you can use char* my experiements conclude that for Gcc 4.7.2 on an Intel Xeon X5650 the fastest general purpose hamming distance calculating function for small strings (char arrays) is:

// na = length of both strings
unsigned int HammingDistance(const char* a, unsigned int na, const char* b) {

    unsigned int num_mismatches = 0;
    while (na) {
        if (*a != *b)
            ++num_mismatches;

        --na;
        ++a;
        ++b;
    }

    return num_mismatches;
}

If your problem allows you to set an upper distance limit, so that you don't care for greater distances and this limit is always less than the strings' length, the above example can be furhterly optimized to:

// na = length of both strings, dist must always be < na
unsigned int HammingDistance(const char* const a, const unsigned int na, const char* const b, const unsigned int dist) {

    unsigned int i = 0, num_mismatches = 0;

    while(i <= dist)
    {
        if (a[i] != b[i])
            ++num_mismatches;

        ++i;
    }

    while(num_mismatches <= dist && i < na)
    {
        if (a[i] != b[i])
            ++num_mismatches;

        ++i;
    }

    return num_mismatches;
}

I am not sure if const does anything regarding speed, but i use it anyways...

edited May 23 '17 at 11:54

Community

1
1

answered Apr 21 '13 at 21:16

pdrak

404
1
4
12

(1) Performance depends on the compiler *and* CPU, among other things. "This is the fastest" is misleading at best, and relies on the code being compiled exactly as your compiler did -- which is not required by any standards. (2) Love how you ignore the fact that the caller has to find lengths. If this code bothered to, its speed would be cut in half. (3) C is not C++. Your "strings" are not C++ strings. This could have been done with C++ strings without sacrificing performance. (4) Seriously? You resurrected a 4 year old question for this? – cHao Apr 22 '13 at 15:13
(1) Gcc 4.7.2 for Intel Xeon X5650. (2-3-4 etc...) I "resurected" this, as you say, because i have already started a new thread which is considered a duplicate of this. This answer serves as a good answer to my original thread that i can not answer, so i answer my thread here. If this answer does not fit here means that my thread is no duplicate of this. Can i throw this answer to my "duplicate" post some other way? – pdrak Apr 22 '13 at 22:40
And something more. The author said that his code was "hopelessly slow". One reason i am writing this is to offer him an alternative that is "get rid of the Strings" (if possible) and use char*. In the above setup difference was huge when we transformed all Strings to char*. It could be a solution for him to do the same. (i did not notice how old the post was) – pdrak Apr 22 '13 at 23:19

score 2 · Answer 6 · answered Feb 17 '09 at 14:54

2

Some obvious points that might make it faster:

Pass the strings as const references, not by value
Use the indexing operator [] to get characters, not a method call
Compile with optimization on

answered Feb 17 '09 at 14:54

unwind

391,730
64
469
606

How do you "compile with optimization on" ? – neversaint Feb 17 '09 at 15:38
Depends very much on the compiler in use, I'm afraid. If you're using GCC for instance, use the -On option, where n is a digit that controls the level of optimization. – unwind Feb 17 '09 at 20:25

Bit Operation For Finding String Difference

6 Answers6

Related