0


I am currently experimenting with a very simple Boyer-Moore variant.
In general my implementation works, but if I try to utilize it in a loop the character pointer containing the haystack gets messed up. And I mean that characters in it are altered, or mixed.
The result is consistent, i.e. running the same test multiple times yields the same screw up.

This is the looping code:

string src("This haystack contains a needle! needless to say that only 2 matches need to be found!");
string pat("needle");
const char* res = src.c_str();

while((res = boyerMoore(res, pat)))
    ++res;

This is my implementation of the string search algorithm (the above code calls a convenience wrapper which pulls the character pointer and length of the string):

unsigned char*
boyerMoore(const unsigned char* src, size_t srcLgth, const unsigned char* pat, size_t patLgth)
{
    if(srcLgth < patLgth || !src || !pat)
        return nullptr;

    size_t skip[UCHAR_MAX]; //this is the skip table
    for(int i = 0; i < UCHAR_MAX; ++i)
        skip[i] = patLgth; //initialize it with default value

    for(size_t i = 0; i < patLgth; ++i)
        skip[(int)pat[i]] = patLgth - i - 1; //set skip value of chars in pattern

    std::cout<<src<<"\n"; //just to see what's going on here!

    size_t srcI = patLgth - 1; //our first character to check
    while(srcI < srcLgth)
    {
        size_t j = 0; //char match ct
        while(j < patLgth)
        {
            if(src[srcI - j] == pat[patLgth - j - 1])
                ++j;
            else
            {
                //since the number of characters to skip may be negative, I just increment in that case

                size_t t = skip[(int)src[srcI - j]];
                if(t > j)
                    srcI = srcI + t - j;
                else
                    ++srcI;
                break;
            }
        }
        if(j == patLgth)
            return (unsigned char*)&src[srcI + 1 - j];
    }
    return nullptr;
}

The loop produced this output (i.e. these are the haystacks the algorithm received):

  1. This haystack contains a needle! needless to say that only 2 matches need to be found!
  2. eedle! needless to say that only 2 matches need to be found!
  3. eedless to say that eed 2 meed to beed to be found!

As you can see the input is completely messed up after the second run. What am I missing? I thought the contents could not be modified, since I'm passing const pointers.
Is the way of setting the pointer in the loop wrong, or is my string search screwing up?

Btw: This is the complete code, except for includes and the main function around the looping code.

EDIT:

The missing nullptr of the first return was due to a copy/paste error, in the source it is actually there.

For clarification, this is my wrapper function:

inline char* boyerMoore(const string &src, const string &pat)
{
    return (const char*) boyerMoore((const unsigned char*) src.c_str(), src.size(),
            (const unsigned char*) pat.c_str(), pat.size());
}
Mäx Müller
  • 233
  • 1
  • 3
  • 15
  • 1
    _"What am I missing?"_ Stepping through your code with a debugger. – πάντα ῥεῖ Sep 01 '15 at 13:28
  • That "off-by-one" is not an error, that is because I'm incrementing the result once and then pass it to the function. The changing of `src` is what bugs me. – Mäx Müller Sep 01 '15 at 13:34
  • You are not changing src, you are chaning res. res is a pointer to a char*, and you incremented it (++res after the while). – kist Sep 01 '15 at 13:36
  • Sorry, meant `res`... – Mäx Müller Sep 01 '15 at 13:38
  • There are several issues that prevent this from even compiling, but once I did compile it, it ran without the #3 error that you have. You're passing 2 parameters to a function that takes 4 arguments, and you're mixing up char* and unsigned char*. Also, the first return of your `boyerMoore()` isn't returning a value. What compiler are you using for this? I was using Visual Studio 2013. – JPhi1618 Sep 01 '15 at 13:51
  • As mentioned, what I call is a convenience wrapper, which does the conversion to `unsigned char*` and provides the length. Though it is strange, that it didn't return a value for you. I'm using `gcc 4.9.2`. – Mäx Müller Sep 01 '15 at 13:54
  • Sorry, I missed the bit about the wrapper. Your function above has a `return;`, and I think that needs to be `return nullptr;` That could be the issue. Also, I guess your wrapper is using something like `strlen(res)` because using the wrong length could also cause problems. – JPhi1618 Sep 01 '15 at 13:56

1 Answers1

1

In your boyerMoore() function, the first return isn't returning a value (you have just return; rather than return nullptr;) GCC doesn't always warn about missing return values, and not returning anything is undefined behavior. That means that when you store the return value in res and call the function again, there's no telling what will print out. You can see a related discussion here.

Also, you have omitted your convenience function that calculates the length of the strings that you are passing in. I would recommend double checking that logic to make sure the sizes are correct - I'm assuming you are using strlen or similar.

Community
  • 1
  • 1
JPhi1618
  • 783
  • 1
  • 11
  • 21
  • Since I'm using `std::string` in this case I'm relying on the `size()` member function. The missing return value was actually a copy/paste error, I edited the question to reflect this. – Mäx Müller Sep 01 '15 at 14:59
  • In your added wrapper, you casting the return value as (const char*), but the function returns char*. Thats not legal. Also, what happened to unsigned char*? I feel like there are details in your code that you are not posting that are causing issues. The incomplete examples aren't helping. – JPhi1618 Sep 01 '15 at 16:09
  • Remove all your wrappers and make all your pointers agree. Stop using `unsigned char*` unless there is a very compelling reason (tip: probably not). The main boyerMoore function seems to work fine, so again I think the devil is in the details _that you aren't showing us_. – JPhi1618 Sep 01 '15 at 16:21