6

I have a large binary file that is around 70MB in size. In my program, I have a method that looks up byte[] array patterns against the file, to see if they exist within the file or not. I have around 1-10 millions patterns to run against the file. The options I see are the following:

  1. Read the file into memory by doing byte[] file = File.ReadAllBytes(path) then perform byte[] lookup of byte[] pattern(s) against the file bytes. I have used multiple methods for doing that from different topics on SO such as: byte[] array pattern search Find an array (byte[]) inside another array? Best way to check if a byte[] is contained in another byte[] Though, byte[] versus byte[] lookups are extremely slow when the source is large in size. It would take take weeks to run 1 million patterns on normal computers.
  2. Convert both the file and patterns into hex strings then do the comparisons using contains() method to perform the lookup. This one is faster than byte[] lookups but converting bytes to hex would result in the file being larger in memory which results in more processing time.
  3. Convert both the file and pattern into strings using Encoding.GetEncoding(1252).GetBytes() and perform the lookups. Then, compensate for the limitation of binary to string conversion (I know they incompatible) by running the matches of contains() against another method which performs byte[] lookups (suggested first option). This one is the fastest option for me.

Using the third approach, which is the fastest, 1 million patterns would take 2/3 of a day to a day depending on CPU. I need information on how to speed up the lookups.

Thank you.

Edit: Thanks to @MySkullCaveIsADarkPlace I now have a fourth approach which is faster than the three approaches above. I was using limited byte[] lookup algorithms and now I am using MemoryExtensions.IndexOf() byte[] lookup method which is slightly faster than the three approaches above. Though, even though this method is faster, the lookups are still slow. It takes 1 minute for 1000 pattern lookups.

The patterns are 12-20 bytes each.

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
Tiano Manti
  • 111
  • 8
  • 1
    "_This one is faster than byte[] lookups_" That shouldn't be, unless the byte array lookup is implemented in an inefficient manner. Basically, an ordinal sub-string lookup in a string should perform more or less the same as a sub-array lookup in a byte[], if both are implemented equally well. So, what exactly are you using to do the byte[] lookups? Are you using https://learn.microsoft.com/en-us/dotnet/api/system.memoryextensions.indexof#system-memoryextensions-indexof-1(system-readonlyspan((-0))-system-readonlyspan((-0))) ? –  Sep 25 '22 at 13:48
  • 1
    (FYI: Considering that a char in .NET is 16 bits, you should see very similar performance figures when comparing an ordinal search of a substring in a string compared to searching a sub-byte-array with a length equal to 2 * substring.Length in a byte[] with length 2 * string.Length when done correctly and where the first match is located at the same byte position/memory address relative to the searched space.) –  Sep 25 '22 at 14:07
  • What is the average length of the patterns? – Theodor Zoulias Sep 25 '22 at 15:33
  • @MySkullCaveIsADarkPlace Thanks for the information. I have looked into the page you mentioned and now I am using `if (file.AsSpan().IndexOf(pattern) != -1)` where `file` is my file bytes and `pattern` is my pattern bytes I am looking for in file bytes. I am not sure if that the right usage for it but that line actually provides a slightly faster speed than all the 3 approaches above. Please verify if my usage is correct and I will edit my question. Note that even with this fourth approach lookups are still slow. Takes around 1 minute for 1000 patterns. Regards. – Tiano Manti Sep 25 '22 at 15:34
  • 1
    @TheodorZoulias They are 12-20 bytes each. – Tiano Manti Sep 25 '22 at 15:35
  • 1
    Your usage of Span.IndexOf looks alright. And it's as fast as it can get (it's using SIMD instructions, btw), unless your byte arrays you are searching in and searching for follow some pattern or structure that is conducive an exploitable by crafting a bespoke search algorithms relying on such patterns structure... –  Sep 25 '22 at 15:38
  • If the search patterns are small in comparison to the size of the search space, you might perhaps benefit from parallelizing the search. Basically you divide your search space into logical regions that can be searched in parallel (and can be early aborted once a match is found). Each of these regions need to overlap their neighbouring regions by pattern length - 1. –  Sep 25 '22 at 15:40
  • Aha, they are pretty small. So the longest pattern is 20 bytes? Is this the hard limit? – Theodor Zoulias Sep 25 '22 at 15:41
  • 1
    @TheodorZoulias I am not sure what you mean by hard limit but if you mean maximum pattern size possible then yes. Maximum is 20 bytes. – Tiano Manti Sep 25 '22 at 15:48
  • @MySkullCaveIsADarkPlace I am not sure how to do that since that is the first time I heard of it. Are there any SO posts or topics I can refer to to understand what you are saying. Would be nice if you could provide an example as answer. If not, thanks anyway. – Tiano Manti Sep 25 '22 at 15:49
  • You would basically use [`Parallel.ForEach`](https://learn.microsoft.com/en-us/dotnet/api/system.threading.tasks.parallel.foreach). First you would calculate the start offset and end offset of the overlapping regions (the number of regions should correlate with the number of available cores), and collect them in some collection. The items in this collection (the region info) is then passed by Parallel.Foreach to the job function/delegate. Inside the job function, a span will be aquired from the source byte array using the passed region information... –  Sep 25 '22 at 15:54
  • Correction to my last comment: Since it is not possible to abort Span.IndexOf early, the number and therefore size of regions should _not_ depend on the number of cores (as this would probably result in large regions and long-ish running jobs), but regions should be small enough to not run for a too long time. But they should not be too small so as not to create too many parallel jobs/threads, as the overhead of creating/scheduling/running very many jobs in parallel might very well eliminate most if not all of the possible perf benefit. –  Sep 25 '22 at 16:01
  • (If peak performance matters a lot, you should benchmark this with a variety of datasets on every different CPU model your code is supposed to run to find the optimal region size and therefore job/thread count for a given CPU model) –  Sep 25 '22 at 16:02
  • Are you trying to build your own virus scanner? – CodeCaster Sep 25 '22 at 16:33
  • @CodeCaster No. – Tiano Manti Sep 26 '22 at 16:58

2 Answers2

6

I assume that you are looking up one pattern after the other. I.e., you are doing 1 to 10 million pattern searches at every position in the file!

Consider doing it the other way round. Loop once through your file bytes and determine if the current position is the start of a pattern.

To do this efficiently, I suggest organizing the patterns in an array of list of patterns. Each pattern is stored in a list at array index 256 * byte[0] + byte[1]. With 10 million patterns you will have an average of 152 patterns in the lists at each array position. This allows a fast lookup.

You could also use the 3 first bytes (256 * (256 * byte[0] + byte[1]) + byte[2]) resulting in an array of length 256^3 ~ 16 millions (I worked with longer arrays; no problem for C#). Then you would have less than one pattern per array position in average. This would result in a nearly linear search time O(n) with respect to the file length. A huge improvement compared to the quadratic O(num_of_patterns * file_length) for a straight forward algorithm.

We can use a simple byte by byte comparison to compare the patterns, since we can compare starting at a known position. (Boyer Moore is of no use here.)

2 bytes index (patterns must be at least 2 bytes long)

byte[] file = { 23, 36, 43, 76, 125, 56, 34, 234, 12, 3, 5, 76, 8, 0, 6, 125, 234, 56, 211, 122, 22, 4, 7, 89, 76, 64, 12, 3, 5, 76, 8, 0, 6, 125 };
byte[][] patterns = {
    new byte[]{ 12, 3, 5, 76, 8, 0, 6, 125, 11 },
    new byte[]{ 211, 122, 22, 4 },
    new byte[]{ 17, 211, 5, 8 },
    new byte[]{ 22, 4, 7, 89, 76, 64 },
};
var patternMatrix = new List<byte[]>[256 * 256];

// Add patterns to matrix.
// We assume pattern.Length >= 2.
foreach (byte[] pattern in patterns) {
    int index = 256 * pattern[0] + pattern[1];
    patternMatrix[index] ??= new List<byte[]>(); // Ensure we have a list.
    patternMatrix[index].Add(pattern);
}

// The search. Loop through the file
for (int fileIndex = 0; fileIndex < file.Length - 1; fileIndex++) { // Length - 1 because we need 2 bytes.
    int patternIndex = 256 * file[fileIndex] + file[fileIndex + 1];
    List<byte[]> candiatePatterns = patternMatrix[patternIndex];
    if (candiatePatterns != null) {
        foreach (byte[] candidate in candiatePatterns) {
            if (fileIndex + candidate.Length <= file.Length) {
                bool found = true;
                // We know that the 2 first bytes are matching,
                // so let's start at the 3rd
                for (int i = 2; i < candidate.Length; i++) {
                    if (candidate[i] != file[fileIndex + i]) {                             
                        found = false;
                        break;
                    }
                }
                if (found) {
                    Console.WriteLine($"pattern {{{candidate[0]}, {candidate[1]} ..}} found at file index {fileIndex}");
                }
            }
        }
    }
}

Same algorithm with 3 bytes (even faster!)

3 bytes index (patterns must be at least 3 bytes long)

var patternMatrix = new List<byte[]>[256 * 256 * 256];

// Add patterns to matrix.
// We assume pattern.Length >= 3.
foreach (byte[] pattern in patterns) {
    int index = 256 * 256 * pattern[0] + 256 * pattern[1] + pattern[2];
    patternMatrix[index] ??= new List<byte[]>(); // Ensure we have a list.
    patternMatrix[index].Add(pattern);
}

// The search. Loop through the file
for (int fileIndex = 0; fileIndex < file.Length - 2; fileIndex++) { // Length - 2 because we need 3 bytes.
    int patternIndex = 256 * 256 * file[fileIndex] + 256 * file[fileIndex + 1] + file[fileIndex + 2];
    List<byte[]> candiatePatterns = patternMatrix[patternIndex];
    if (candiatePatterns != null) {
        foreach (byte[] candidate in candiatePatterns) {
            if (fileIndex + candidate.Length <= file.Length) {
                bool found = true;
                // We know that the 3 first bytes are matching,
                // so let's start at the 4th
                for (int i = 3; i < candidate.Length; i++) {
                    if (candidate[i] != file[fileIndex + i]) {
                        found = false;
                        break;
                    }
                }
                if (found) {
                    Console.WriteLine($"pattern {{{candidate[0]}, {candidate[1]} ..}} found at file index {fileIndex}");
                }
            }
        }
    }
}

Why is it faster?

A simple nested loops algorithm compares up to ~ 706 * 106 = 7 * 1014 (700 trillion) patterns! 706 is the length of the file. 106 is the number of patterns.

My algorithm with a 2 bytes index makes ~ 706 * 152 = 1010 pattern comparisons. The number 152 comes from the fact that there are in average 152 patterns for a given 2 bytes index ~ 106/(256 * 256). This is 65,536 times faster.

With 3 bytes you get less than about 706 pattern comparisons. This is more than 10 million times faster. This is the case because we store all the patterns in an array whose length is greater (16 millions) than the number of patterns (10 millions or less). Therefore, at any byte position plus 2 following positions within the file, we can pick up only the patterns starting with the same 3 bytes. And this is in average less than one pattern. Sometimes there may be 0 or 1, sometimes 2 or 3, but rarely more patterns at any array position.

Try it. The shift is from O(n2) to near O(n). The initialization time is O(n). The assumption is that the 2 or 3 first bytes of the patterns are more or less randomly distributed. If this was not the case, my algorithm would degrade to O(n2) in the worst case.

Okay, that's the theory. Since the 3 bytes index version is slower at initialization it may have only an advantage with huge data sets. Other improvements could be made by using Span<byte>.

See: Big O notation - Wikipedia.

Olivier Jacot-Descombes
  • 104,806
  • 13
  • 138
  • 188
  • Thank you for the reply. Though, I feel like I am doing something wrong. I have tested the algorithm with a sample of 200k patterns and it finished in around 1 minute only. Note that the logic behind the algorithm is beyond me. Here is how I used your algorithm. Please let me know if I have messed up or something. https://codeshare.io/VZqvg8 – Tiano Manti Sep 25 '22 at 22:37
  • I have tested the algorithm with several patterns which I have placed randomly into the file. 200k sample and it would finish in around 1 minute. I cannot believe how fast this algorithm is. I am still second guessing myself. I will make sure everything works then I will accept this as answer. – Tiano Manti Sep 25 '22 at 23:20
  • What if the first two bytes of the patterns have not a random distribution? For example, extreme case, imagine that all the patterns start with 0,0, and the `file` contains all zeros. In that case the complexity will most likely deteriorate back to quadratic O(n*m). – Theodor Zoulias Sep 26 '22 at 09:56
  • @TheodorZoulias, yes. But obviously this did not happen. OP says speed increase from 2/3 of a day to a day to now around 1 minute! With 2 bytes as index. With 3 bytes it will be even faster; however the patterns must be at least 3 bytes long. Added the algorithm for 3 bytes. – Olivier Jacot-Descombes Sep 26 '22 at 12:25
  • @OlivierJacot-Descombes Thanks a lot for the algorithm. It made my life easier. I wonder if there is any guide or post that explains it to someone who has limited knowledge of those things. I want to understand how it works but the information you posted is pretty too much advanced for me. Thanks for your time. – Tiano Manti Sep 26 '22 at 12:40
  • We don't know if the OP wants to run the program just once, for a given file and a given set of patterns. We don't have the broad picture of their domain problem. So I think that it would be a good idea to mention the limitations of this algorithm, and the assumptions that have to be made in order to get the advertised O(n) complexity. Don't get me wrong, I like this algorithm, and I have upvoted the answer. – Theodor Zoulias Sep 26 '22 at 13:11
  • 1
    It is simpler than it looks. I store the patterns in a huge array `patternMatrix`. The array elements are a list of patterns (or `null` if no pattern is stored there). The start of a pattern (2 or 3 bytes) makes the index in this array. Then I loop through the file bytes. At each file position I take the current byte plus the following 1 or 2 bytes (depending on whether a 2 or 3 byte index is used) to calculate the index again. Then instead of comparing all the 10 million patterns, I compare just the ones at this array index in `patternMatrix`. – Olivier Jacot-Descombes Sep 26 '22 at 13:11
  • 2
    @TheodorZoulias, I added a mention of this limitation at the end of my answer. Thank you for pointing out. – Olivier Jacot-Descombes Sep 26 '22 at 13:18
  • @OlivierJacot-Descombes I need to track the progress of patterns tested. Is it possible to do this with the algorithm? – Tiano Manti Sep 26 '22 at 17:50
  • 1
    No Problem, just report the progress e.g., when `fileIndex % 10000 == 0`. I made a little improvement. We can start comparing the pattern at index 3 or 4, since we know that the 2 or 3 first bytes are equal. When doing to, strangely the 2 byte version became faster than the 3 byte version. Probably because the initialization of the 3 byte version is slower. This made a > 35% improvement in the 2 byte version. – Olivier Jacot-Descombes Sep 27 '22 at 12:03
  • @OlivierJacot-Descombes I really cannot thank you enough for the algorithm and help. – Tiano Manti Sep 28 '22 at 21:52
  • @OlivierJacot-Descombes Does your edit mean the 2+ byte implementation is better than the 3+ byte implementation now? – Sami Sep 29 '22 at 20:50
  • No, it depends on the file size and number of patterns. I created random file and pattern bytes by using the `System.Random` class. With 10 million patterns with sizes between 4 and 15 and file size = 1 million I got 3.9s and 7.2s for the 2 and 3 bytes. With the same number of pattern and a file size of 10 million I got 25.6s and 7.6s for 2 and 3 bytes. I.e., for large files 3 bytes is better. But you must do these tests (I used `System.Diagnostics.Stopwatch`. In Release mode) with real files and real patterns, because those will not contain random patterns and thus behave differently. – Olivier Jacot-Descombes Sep 30 '22 at 13:21
1

One idea is to group the patterns by their length, put each group in a HashSet<byte[]> for searching with O(1) complexity, and then scan the source byte[] index by index for all groups. Since the number of groups in your case is small (only 9 groups), this optimization should yield significant performance improvements. Here is an implementation:

IEnumerable<byte[]> FindMatches(byte[] source, byte[][] patterns)
{
    Dictionary<int, HashSet<ArraySegment<byte>>> buckets = new();
    ArraySegmentComparer comparer = new();
    foreach (byte[] pattern in patterns)
    {
        HashSet<ArraySegment<byte>> bucket;
        if (!buckets.TryGetValue(pattern.Length, out bucket))
        {
            bucket = new(comparer);
            buckets.Add(pattern.Length, bucket);
        }
        bucket.Add(pattern); // Implicit cast byte[] => ArraySegment<byte> 
    }
    for (int i = 0; i < source.Length; i++)
    {
        foreach (var (length, bucket) in buckets)
        {
            if (i + length > source.Length) continue;
            ArraySegment<byte> slice = new(source, i, length);
            if (bucket.TryGetValue(slice, out var pattern))
            {
                yield return pattern.Array;
                bucket.Remove(slice);
            }
        }
    }
}

Currently (.NET 6) there is no equality comparer for sequences available in the standard libraries, so you'll have to provide a custom one:

class ArraySegmentComparer : IEqualityComparer<ArraySegment<byte>>
{
    public bool Equals(ArraySegment<byte> x, ArraySegment<byte> y)
    {
        return x.AsSpan().SequenceEqual(y);
    }

    public int GetHashCode(ArraySegment<byte> obj)
    {
        HashCode hashcode = new();
        hashcode.AddBytes(obj);
        return hashcode.ToHashCode();
    }
}

This algorithm assumes that there are no duplicates in the patterns. In case that's not the case, only one of the duplicates will be emitted.

In my (not very speedy) PC this algorithm takes around 10 seconds to create the buckets dictionary (for 10,000,000 patterns with size 12-20), and then additional 5-6 minutes to scan a source byte[] of size 70,000,000 (scans around 200,000 bytes per second). The number of the patterns does not affect the scanning phase (as long as the number of the groups is not increased).

Parallelizing this algorithm is not trivial, because the buckets are mutated during the scan.

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
  • Thanks for the reply. Though, I gave the algorithm only only 10 patterns and waited 11 minutes and the method did not finish. – Tiano Manti Sep 25 '22 at 22:11
  • @Tiano Manti weird. You could try adding this line inside the `for` loop: `if (i % 100000 == 0) Console.WriteLine(i);`, so that you have a visual indication of the progress. – Theodor Zoulias Sep 25 '22 at 22:55
  • @TianoManti btw make sure to run the program in Release mode. Running in Debug mode is slower. – Theodor Zoulias Sep 26 '22 at 02:54