I have a task where I need to look up for sequences in a file. When doing a test application, I have read file as string (File.ReadAllText) and used string.IndexOf to look up for a sequence. When I tried to implement the same algorithm with bytes (reading file as byte array and looking up byte array in byte array), I noticed that looking for byte[] in byte[] is about 3 times as slow as looking for string in string. I made sure to check it throughly, and exactly the same code, one using byte[] and other using string, take 3 times as much to execute - like, 16s for byte vs ~5s for string.
For looking up byte arrays, I used ways described here byte[] array pattern search. For looking up strings, I used built-in IndexOf function of string class. Here is one of the implementations of IndexOf for byte[] I tried:
public int IndexOf(byte[] source, byte[] pattern, int startpos = 0)
{
int search_limit = source.Length - pattern.Length;
for (int i = startpos; i < search_limit; i++)
{
if (source[i] == pattern[0])
{
bool found = true;
for (int j = 1; j < pattern.Length; j++)
{
if (source[i + j] != pattern[j])
{
found = false;
break;
}
}
if (found)
return i;
}
}
return -1;
}
Basically, looking up next match for sequence of bytes in bytes array takes three time as long as looking up next match for sequence of chars in string. My question is - WHY?
Does anyone know how .Net handles looking up chars in string, what kind of optimisation it does, what algorithm it uses? Is there a faster algorithm than the one I'm using here? Maybe someone has an idea what I'm doing wrong here so that it takes more time than it should? I really do not understand how looking up string in string can be 3 times as fast as looking up byte[] in byte[]...
UPDATE: I have tried the unsafe algorithm as suggested. It was as follows:
public static unsafe long IndexOfFast(byte[] Haystack, byte[] Needle, long startpos = 0)
{
long i = startpos;
fixed (byte* H = Haystack) fixed (byte* N = Needle)
{
for (byte* hNext = H + startpos, hEnd = H + Haystack.LongLength; hNext < hEnd; i++, hNext++)
{
bool Found = true;
for (byte* hInc = hNext, nInc = N, nEnd = N + Needle.LongLength; Found && nInc < nEnd; Found = *nInc == *hInc, nInc++, hInc++) ;
if (Found) return i;
}
return -1;
}
}
}
strange thing is, it actually proved to be twice as slow! I changed it to add my personal tweak (checking first letter before attempting to iterate through needle) and it looks like this now:
public static unsafe long IndexOfFast(byte[] Haystack, byte[] Needle, long startpos = 0)
{
long i = startpos;
fixed (byte* H = Haystack) fixed (byte* N = Needle)
{
for (byte* hNext = H + startpos, hEnd = H + Haystack.LongLength; hNext < hEnd; i++, hNext++)
{
if (*hNext == *N)
{
bool Found = true;
for (byte* hInc = hNext+1, nInc = N+1, nEnd = N + Needle.LongLength; Found && nInc < nEnd; Found = *nInc == *hInc, nInc++, hInc++) ;
if (Found) return i;
}
}
return -1;
}
}
Now, it takes exactly the same time to execute as the safe one. My question is again - any ideas why? Shouldn't it be faster because it's unsafe and operates with pointers, when compared to safe?