I assume that you are looking up one pattern after the other. I.e., you are doing 1 to 10 million pattern searches at every position in the file!
Consider doing it the other way round. Loop once through your file bytes and determine if the current position is the start of a pattern.
To do this efficiently, I suggest organizing the patterns in an array of list of patterns. Each pattern is stored in a list at array index 256 * byte[0] + byte[1]
.
With 10 million patterns you will have an average of 152 patterns in the lists at each array position. This allows a fast lookup.
You could also use the 3 first bytes (256 * (256 * byte[0] + byte[1]) + byte[2]
) resulting in an array of length 256^3 ~ 16 millions (I worked with longer arrays; no problem for C#). Then you would have less than one pattern per array position in average. This would result in a nearly linear search time O(n)
with respect to the file length. A huge improvement compared to the quadratic O(num_of_patterns * file_length)
for a straight forward algorithm.
We can use a simple byte by byte comparison to compare the patterns, since we can compare starting at a known position. (Boyer Moore is of no use here.)
2 bytes index (patterns must be at least 2 bytes long)
byte[] file = { 23, 36, 43, 76, 125, 56, 34, 234, 12, 3, 5, 76, 8, 0, 6, 125, 234, 56, 211, 122, 22, 4, 7, 89, 76, 64, 12, 3, 5, 76, 8, 0, 6, 125 };
byte[][] patterns = {
new byte[]{ 12, 3, 5, 76, 8, 0, 6, 125, 11 },
new byte[]{ 211, 122, 22, 4 },
new byte[]{ 17, 211, 5, 8 },
new byte[]{ 22, 4, 7, 89, 76, 64 },
};
var patternMatrix = new List<byte[]>[256 * 256];
// Add patterns to matrix.
// We assume pattern.Length >= 2.
foreach (byte[] pattern in patterns) {
int index = 256 * pattern[0] + pattern[1];
patternMatrix[index] ??= new List<byte[]>(); // Ensure we have a list.
patternMatrix[index].Add(pattern);
}
// The search. Loop through the file
for (int fileIndex = 0; fileIndex < file.Length - 1; fileIndex++) { // Length - 1 because we need 2 bytes.
int patternIndex = 256 * file[fileIndex] + file[fileIndex + 1];
List<byte[]> candiatePatterns = patternMatrix[patternIndex];
if (candiatePatterns != null) {
foreach (byte[] candidate in candiatePatterns) {
if (fileIndex + candidate.Length <= file.Length) {
bool found = true;
// We know that the 2 first bytes are matching,
// so let's start at the 3rd
for (int i = 2; i < candidate.Length; i++) {
if (candidate[i] != file[fileIndex + i]) {
found = false;
break;
}
}
if (found) {
Console.WriteLine($"pattern {{{candidate[0]}, {candidate[1]} ..}} found at file index {fileIndex}");
}
}
}
}
}
Same algorithm with 3 bytes (even faster!)
3 bytes index (patterns must be at least 3 bytes long)
var patternMatrix = new List<byte[]>[256 * 256 * 256];
// Add patterns to matrix.
// We assume pattern.Length >= 3.
foreach (byte[] pattern in patterns) {
int index = 256 * 256 * pattern[0] + 256 * pattern[1] + pattern[2];
patternMatrix[index] ??= new List<byte[]>(); // Ensure we have a list.
patternMatrix[index].Add(pattern);
}
// The search. Loop through the file
for (int fileIndex = 0; fileIndex < file.Length - 2; fileIndex++) { // Length - 2 because we need 3 bytes.
int patternIndex = 256 * 256 * file[fileIndex] + 256 * file[fileIndex + 1] + file[fileIndex + 2];
List<byte[]> candiatePatterns = patternMatrix[patternIndex];
if (candiatePatterns != null) {
foreach (byte[] candidate in candiatePatterns) {
if (fileIndex + candidate.Length <= file.Length) {
bool found = true;
// We know that the 3 first bytes are matching,
// so let's start at the 4th
for (int i = 3; i < candidate.Length; i++) {
if (candidate[i] != file[fileIndex + i]) {
found = false;
break;
}
}
if (found) {
Console.WriteLine($"pattern {{{candidate[0]}, {candidate[1]} ..}} found at file index {fileIndex}");
}
}
}
}
}
Why is it faster?
A simple nested loops algorithm compares up to ~ 706 * 106 = 7 * 1014 (700 trillion) patterns! 706 is the length of the file. 106 is the number of patterns.
My algorithm with a 2 bytes index makes ~ 706 * 152 = 1010 pattern comparisons. The number 152 comes from the fact that there are in average 152 patterns for a given 2 bytes index ~ 106/(256 * 256). This is 65,536 times faster.
With 3 bytes you get less than about 706 pattern comparisons. This is more than 10 million times faster. This is the case because we store all the patterns in an array whose length is greater (16 millions) than the number of patterns (10 millions or less). Therefore, at any byte position plus 2 following positions within the file, we can pick up only the patterns starting with the same 3 bytes. And this is in average less than one pattern. Sometimes there may be 0 or 1, sometimes 2 or 3, but rarely more patterns at any array position.
Try it. The shift is from O(n2) to near O(n). The initialization time is O(n). The assumption is that the 2 or 3 first bytes of the patterns are more or less randomly distributed. If this was not the case, my algorithm would degrade to O(n2) in the worst case.
Okay, that's the theory. Since the 3 bytes index version is slower at initialization it may have only an advantage with huge data sets. Other improvements could be made by using Span<byte>
.
See: Big O notation - Wikipedia.