2

As part of a high performance computing course I'm trying to speed up a naive string search in C as much as possible, and I'd like to know if there's anything obvious that could be done that I've missed in my current iteration or if there's a better direction to go in.

Some restrictions are that the pattern must be searched left to right, each character must be checked separately, and it needs to be based on a for loop.

So far I've reduced the time taken for a pattern of 999 As followed by a B in a text of 9,999,999 As followed by a B from ~18 seconds to ~9 seconds, but I'm not sure if it could be faster given the restrictions above.

The current code is:

int match(char** text, int n, char** pattern, int m){
    int i, j, last = n-m;

    for(i = 0; i <= last; i++){
        for(j = 0; j < m; j++){
            if((*text)[i+j] != (*pattern)[j]){
                break;
            }
        }

        if(j == m){
            return i;
        }
    }
    return -1;
}

Where text is the text to be searched, pattern is the pattern to be searched, n is the length of the text, and m is the length of the pattern.

Edit: This is an implementation of the naive string search algorithm. https://en.wikipedia.org/wiki/String_searching_algorithm#Na.C3.AFve_string_search

Edit 2: Code after @RadLexus idea to change from char** to char* (25% faster):

int match(char* text, int n, char* pattern, int m){
    int i, j, last = n-m;

    for(i = 0; i <= last; i++){
        for(j = 0; j < m; j++){
            if(text[i+j] != pattern[j]){
                break;
            }
        }

        if(j == m){
            return i;
        }
    }
    return -1;
}
seashairo
  • 29
  • 5
  • I'm not sure that I understood your task, but you can use `if((*text)[i+j] != (*pattern)[j]){ i = i+j; break; }` to avoid unnecessary iterations – Yara_M May 08 '16 at 00:24
  • Why the restrictions? With those you are missing out on a few pretty good ones. – Jongware May 08 '16 at 00:26
  • 1
    @RadLexus Probably to make sure that we stick to a naive string search instead of using Boyer-Moore or KMP. – seashairo May 08 '16 at 00:29
  • @ziwert I'm sure it's correct - I've tested it on multiple texts and patterns. Which part did you not understand? I added a reference to the naive string search on wikipedia and I can try to explain it better if it's not clear – seashairo May 08 '16 at 00:31
  • 1
    Hm. Was the prototype given to you? I wonder if the double indirection could slow things down, esp. inside the tight loops. For simple strings, a single `char *` for both is enough. – Jongware May 08 '16 at 00:31
  • Have you tried using pointers into the strings instead of indexes? That might be faster because you're not doing the pointer arithmetic every time. Try making your test `if (*p != *q)`. Also pre-calculate the ending values of the pointers in the two loops so you don't have to do any calculations during the loop. – Andy Schweig May 08 '16 at 00:44
  • @RadLexus It wasn't. They provided code to read files and a set of instructions. Changing from char** to char* was a pretty significant speed boost, 25% faster with just a few characters changed... – seashairo May 08 '16 at 00:45
  • I got about a 25% improvement over your latest code using pointers as I mentioned in my earlier comment. – Andy Schweig May 08 '16 at 01:09
  • @AndySchweig I've spent the 25 mins since I saw your comment trying to get pointers to work like you mentioned. I'm not very experienced with C so I think I'm doing it a bit wrong (a lot slower). I have *p and *q being assigned to &text[i] and &pattern[j] inside the i loop, the test as (*p != *q), and then if the test doesn't break incrementing p and q. Any idea what I could be doing wrong? Edit: Actually it is faster, I missed something. – seashairo May 08 '16 at 01:16
  • Use pointer variables in the for loops instead of indexes. That way you eliminate the indexing step. – Andy Schweig May 08 '16 at 01:33
  • Another hint: my version of `match` has no `int`s in it (other than `n` and `m`). – Andy Schweig May 08 '16 at 01:41
  • 1
    Declare `last` as const. `const int last = n-m;` Could be a hint to the compiler to keep `last` in a register on each iteration through the loop. – selbie May 08 '16 at 02:42
  • I suggest changing int to size_t whenever declaring anything that's used as a size or index for an array, as proceasing the sign might incur a slight overhead. Might also be wise to develop a thorough testcase and use valgrind as it seems like there might be situations when access out of bounds is possible, especially if you use memcmp as suggested below in an answer. – autistic May 08 '16 at 02:57
  • 1
    is the question about the patern AAAA...B or about arbitrary patterns ? –  May 08 '16 at 15:37
  • Are algorithmic improvements allowed? Like checking the last character of the pattern for a match, after finding a candidate first character? That would avoid the pathological worst-case you mention in the question. But I think it sounds like "checking the pattern from left to right" rules that out. What about noticing if the pattern starts with a repeated prefix character? You might take advantage of that without violating the left-to-right rule. – Peter Cordes May 09 '16 at 08:43

2 Answers2

4

I got a 10x speedup simply by doing this:

for (int i = 0; i <= last; i++) {
    if (memcmp(&text[i], pattern, m) == 0) {
        return i;
    }
}

Now you may complain that this no longer uses "for" loops and therefore isn't a solution. But memcmp uses loops and you can lift its implementation into your code if you like. As for why it's so much faster, we have a good answer here: Why is memcmp so much faster than a for loop check?

Community
  • 1
  • 1
John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • 1
    This most likely violates the requirement "each character must be checked separately" (there is a significant gain in *not* doing that). It probably violates the *intention* of the question as well. Why not use `strstr`? – Jongware May 08 '16 at 09:53
  • @RadLexus: I addressed that directly in my answer. Yes, `memcmp()` may be outlawed. But you can simply copy its implementation tricks into your code if desired. There's no reason for us to all reinvent this wheel one opcode at a time. – John Zwinck May 08 '16 at 14:28
  • 1
    @RadLexus: it is probable that memcmp is vectorized, hence ruled out (even rewritten). –  May 08 '16 at 14:43
-1

To use only loops, removing the array indexing in favour of pointers will probably provide another speed-up:

int match(char* text, int n, char* pattern, int m){
    char *ptext= text, *pend= text+n, *ptext2;
    char *ppat, *ppatend;

    if (n<m) return -1;

    for (; ptext<pend; ptext++) {
        ptext2= ptext; ppat= pattern; ppatend= ppat+m;
        for (; ppat<ppatend; ppat++, ptext2++) {
            if (*ptext2 != *ppat) {
                break;
            }
        }
        if (ppat==ppatend) {
            return (ptext-text);
        }
    }
    return -1;
}
Paul Ogilvie
  • 25,048
  • 4
  • 23
  • 41
  • 2
    A good optimizer will choose the fastest implementation, whether accesses are done by indexing or pointer dereference. I bet there will be no measurable difference. –  May 08 '16 at 15:14
  • @Yves-daoust, 1) you have to inspect the assembler to learn whether you have a "good" optimizer and I doubt it will turn array-indexing into pointers; 2) his execercise is to optimize his search so he should learn/know about replacing indexing with pointers; he shoud not learn to rely on an optimizer that may not always be available or be "good". – Paul Ogilvie May 08 '16 at 16:33
  • I agree with @YvesDaoust: unless you actually know what asm you want, this is just a bad idea. Your version [has an inner loop that's 5 uops instead of 6, on Intel Haswell, with gcc 5.3](https://godbolt.org/g/05H6hY), since indexed addressing modes can't micro-fuse on SnB-family microarchitectures. What would be more interesting is to count up towards zero with a negative array index, to potentially remove the `cmp` instruction from the loop overhead. (So `inc/jcc` could micro-fuse.) It might be possible to get the inner loop down to 4 fused-domain uops, speeding it up to one per clock. – Peter Cordes May 09 '16 at 09:06
  • A more standard way to do that is to unroll a bit. Even gcc `-funroll-loops` probably helps a bit, but source-level unrolling might help gcc do a better job. – Peter Cordes May 09 '16 at 09:08