7

I have a task where I need to look up for sequences in a file. When doing a test application, I have read file as string (File.ReadAllText) and used string.IndexOf to look up for a sequence. When I tried to implement the same algorithm with bytes (reading file as byte array and looking up byte array in byte array), I noticed that looking for byte[] in byte[] is about 3 times as slow as looking for string in string. I made sure to check it throughly, and exactly the same code, one using byte[] and other using string, take 3 times as much to execute - like, 16s for byte vs ~5s for string.

For looking up byte arrays, I used ways described here byte[] array pattern search. For looking up strings, I used built-in IndexOf function of string class. Here is one of the implementations of IndexOf for byte[] I tried:

    public int IndexOf(byte[] source, byte[] pattern, int startpos = 0)
    {
        int search_limit = source.Length - pattern.Length;
        for (int i = startpos; i < search_limit; i++)
        {
            if (source[i] == pattern[0])
            {
                bool found = true;
                for (int j = 1; j < pattern.Length; j++)
                {
                    if (source[i + j] != pattern[j])
                    {
                        found = false;
                        break;
                    }
                }
                if (found)
                    return i;
            }
        }
        return -1;
    }

Basically, looking up next match for sequence of bytes in bytes array takes three time as long as looking up next match for sequence of chars in string. My question is - WHY?

Does anyone know how .Net handles looking up chars in string, what kind of optimisation it does, what algorithm it uses? Is there a faster algorithm than the one I'm using here? Maybe someone has an idea what I'm doing wrong here so that it takes more time than it should? I really do not understand how looking up string in string can be 3 times as fast as looking up byte[] in byte[]...

UPDATE: I have tried the unsafe algorithm as suggested. It was as follows:

public static unsafe long IndexOfFast(byte[] Haystack, byte[] Needle, long startpos = 0)
    {
        long i = startpos;
        fixed (byte* H = Haystack) fixed (byte* N = Needle)
        {
            for (byte* hNext = H + startpos, hEnd = H + Haystack.LongLength; hNext < hEnd; i++, hNext++)
            {

                    bool Found = true;
                    for (byte* hInc = hNext, nInc = N, nEnd = N + Needle.LongLength; Found && nInc < nEnd; Found = *nInc == *hInc, nInc++, hInc++) ;
                    if (Found) return i;

            }
            return -1;
        }
    }
}

strange thing is, it actually proved to be twice as slow! I changed it to add my personal tweak (checking first letter before attempting to iterate through needle) and it looks like this now:

public static unsafe long IndexOfFast(byte[] Haystack, byte[] Needle, long startpos = 0)
    {
        long i = startpos;
        fixed (byte* H = Haystack) fixed (byte* N = Needle)
        {
            for (byte* hNext = H + startpos, hEnd = H + Haystack.LongLength; hNext < hEnd; i++, hNext++)
            {
                if (*hNext == *N)
                {
                    bool Found = true;
                    for (byte* hInc = hNext+1, nInc = N+1, nEnd = N + Needle.LongLength; Found && nInc < nEnd; Found = *nInc == *hInc, nInc++, hInc++) ;
                    if (Found) return i;
                }
            }
            return -1;
        }
    }

Now, it takes exactly the same time to execute as the safe one. My question is again - any ideas why? Shouldn't it be faster because it's unsafe and operates with pointers, when compared to safe?

Community
  • 1
  • 1
Istrebitel
  • 2,963
  • 6
  • 34
  • 49
  • I would implement [this algorithm](http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm) for byte arrays and test again. – I4V May 31 '13 at 13:06
  • 1
    Well, most string comparison functions and `IndexOf` boil down to an internal CLR call, usually either `InternalFindNLSStringEx` or `InternalCompareString`. A native CLR implementation is probably going to be faster. – vcsjones May 31 '13 at 13:06
  • OK, so now that you have a pointer-arithmetic solution if you want to make it even faster you're going to have to start thinking about what machine operations are actually being done there. For example: what is faster: shifting a byte out of a 32 bit word four times, or making four masks, one with your byte in each of four positions, treating the byte array as an array of int, and checking each int to see if it matches any of your masks when ANDed with the mask? The latter might be a few cycles faster. That's the kind of stuff that people do when optimizing this algorithm. – Eric Lippert Jun 03 '13 at 21:19
  • 1
    Remember, the processor *loves to do things that are the processor word size*. Anything you can do to keep operations in that word size is likely to be rewarded. – Eric Lippert Jun 03 '13 at 21:20

2 Answers2

10

Basically, looking up next match for sequence of bytes in bytes array takes three time as long as looking up next match for sequence of chars in string. My question is - WHY?

Because the string search algorithm has been heavily optimized; it is written in tight unmanaged code which does not spend time checking array bounds. If you were to optimize your byte search algorithm in the same way, it would be just as fast; the string search algorithm uses the same naive algorithm that you are using.

Your algorithm is fine -- this is the standard "naive" search, and Kevin's claims notwithstanding, the naive algorithm is in practice almost always the fastest on typical data. Blazing through an array looking for a byte is incredibly fast on modern hardware. It depends on the size of your problem; if you're looking for a long DNA string in the middle of the human genome then Boyer-Moore is totally worth the expense of preprocessing. If you're looking for 0xDEADBEEF in a twenty KB file then you're not going to beat the naive algorithm if it is efficiently implemented.

For maximum speed what you should do here is turn off the safety system and write the code using unsafe pointer arithmetic.

Eric Lippert
  • 647,829
  • 179
  • 1,238
  • 2,067
  • Thought you would be interested in the timing numbers I've added. It certainly validates that naive search is much better for "typical data" which is searching through a small amount of data. – Kevin Jun 03 '13 at 20:55
4

Your search algorithm for bytes is extremely inefficient!

The baseline algorithm which all other string searches are compared to is Boyer-Moore. I would bet .NET string searches use it or a variation of it. There are others also but implementing Boyer-Moore for bytes will give you much better performance.

Edit: SO to the rescue! Here is a simple C# implementation of Boyer-Moore for byte arrays

Edit with timing numbers: Eric's comments got me very interested because I've always heard the .Net string searches use Boyer-Moore. I really appreciated someone who actually knew telling me otherwise. It makes perfect sense after thinking about it. I decided to do timing on the Boyer-Moore vs Naive byte search and found out Eric is absolutely correct for a small needle and small haystack the naive search is over 13x faster. What I was surprised by though was the "break even" point was much lower than I expected. Boyer-Moore improves significantly with pattern size (or needle size in my timings) so the larger the pattern you are looking for the faster it searches. For an 8 byte needle Naive search vs Boyer-Moore were tied searching through a 500-600 byte haystack. For a larger haystack Boyer-Moore gets significantly better especially with a larger needle. For a 20KB haystack and 64 byte needle Boyer-Moore was 10x faster.

Full timing numbers are below for anyone interested.

All tests were using the simple Boyer-Moore linked above and the Op's naive byte search algorithm he posted doing 1M search iterations.

1000000 iterations, haystack size = 20 bytes, needle size = 8 bytes
20ms total : Naive Search
268ms total : Boyer-Moore Search

1000000 iterations, haystack size = 600 bytes, needle size = 8 bytes
608ms total : Naive Search
582ms total : Boyer-Moore Search

1000000 iterations, haystack size = 2000 bytes, needle size = 8 bytes
2011ms total : Naive Search
1339ms total : Boyer-Moore Search

1000000 iterations, haystack size = 2000 bytes, needle size = 32 bytes
1865ms total : Naive Search
563ms total : Boyer-Moore Search

1000000 iterations, haystack size = 2000 bytes, needle size = 64 bytes
1883ms total : Naive Search
466ms total : Boyer-Moore Search

1000000 iterations, haystack size = 20000 bytes, needle size = 8 bytes
18899ms total : Naive Search
10753ms total : Boyer-Moore Search

1000000 iterations, haystack size = 20000 bytes, needle size = 32 bytes
18639ms total : Naive Search
3114ms total : Boyer-Moore Search

1000000 iterations, haystack size = 20000 bytes, needle size = 64 bytes
18866ms total : Naive Search
1807ms total : Boyer-Moore Search
Community
  • 1
  • 1
Kevin
  • 4,586
  • 23
  • 35
  • Extremely? At the link i provided it's one of the fastest. Can you elaborate what am I doing wrong with my algorithm? – Istrebitel May 31 '13 at 14:06
  • You don't need to check every byte for the start of the pattern. With some preprocessing of your pattern bytes you are able to skip forward in the search array. Boyer-Moore improves in efficiency with longer patterns specifically because you can skip more of the search array. – Kevin May 31 '13 at 14:27
  • Hmm, I see now. Boyer-Moore is an old algorithm, shouldn't there be a well-known implementation for c# for byte[]? I mean I could try to write it down myself but don't want to invent the wheel. – Istrebitel May 31 '13 at 14:34
  • 3
    This answer is actually completely wrong. One of the first questions I asked of the famous Tim Paterson when I started at Microsoft as an intern was why the standard libraries don't use Boyer-Moore or another non-naive algorithm. Tim's wise answer was that for the vast, vast majority of string search problems executed in real line-of-business code, the naive algorithm finds the answer in less time than it takes to *preprocess* the string in Boyer-Moore. Most LOB programmers are looking for "London" in an address, not solving human genome DNA search problems. – Eric Lippert May 31 '13 at 16:28
  • @EricLippert I stand corrected. However hopefully the op will report back on the performance for his scenario. I assumed based on his initial numbers that .IndexOf is not doing a simple character by character search. – Kevin May 31 '13 at 16:35
  • 1
    IndexOf does a naive character-by-character search; it just does it in tightly optimized C code that doesn't do boundary checks on every array access. – Eric Lippert May 31 '13 at 16:42
  • I guess I will check wether it is "safe" to use strings to operate files (meaning, will it correctly IndexOf undisplayer or command characters in a string, and so on) and will just use IndexOf method of a string. If I can't use string, then I'll indeed post back which method works faster - naive or non-naive. – Istrebitel Jun 03 '13 at 07:48
  • @Istrebitel If it's actual binary data I don't think you want to use string comparison because you'll have to use an encoding to convert the data from binary to string. If you don't use the same encoding that was used to write the string to bytes or if the data is actual binary data (didn't start life as a string) then you are going to occasionally "lose" data since it's not valid encoded bytes. – Kevin Jun 03 '13 at 15:57
  • Interesting data! Thanks for posting this. – Eric Lippert Jun 03 '13 at 21:16
  • Thank you very much for your thorough research, I will try Boyer-Moore algorithm myself and reply. – Istrebitel Jun 04 '13 at 07:48
  • I have tried Boyer-Moore algorithm and it gave me very nice results. It's even a bit faster than .Net IndexOf function. Strange thing - algorithm described at wikipedia includes more options (suffix check, and it checks for stop-symbol every time it fails, not only when it fails on 1st symbol) but all those implementations proved to be less fast than this implementation, that only checks for stop symbol on the first try (last symbol of needle). – Istrebitel Jun 05 '13 at 08:40
  • @Istrebitel Excellent. Glad I could help. As the timing numbers show though it's completely dependent on your use case and how big your haystack and needle are. – Kevin Jun 05 '13 at 14:21