6

I've got a very long array of bytes, for example:

Byte[] bytes = {90, 80, 63, 65, 70 ...};

It's nearly 20-30 Mb (theoretically). Is there a fast way to check if this array contains another array, for example:

Byte[] small = {63, 80, 75, 77};

First, I need find bytes in that order which they was defined in small array. Second, I need find array in another array not any byte of small array. Thanks all to advance.

DIF
  • 2,470
  • 6
  • 35
  • 49
kate
  • 291
  • 3
  • 15
  • 1
    Can you be more specific? Do you mean the exact sequence or those numbers, in any order in any place? – II ARROWS Jan 26 '16 at 07:27
  • 1
    Possible duplicate of [Check whether an array is a subset of another](http://stackoverflow.com/questions/332973/check-whether-an-array-is-a-subset-of-another) – Farhad Jabiyev Jan 26 '16 at 07:27
  • Are you looking for the *sequence*, i.e. that those four bytes must occur in that order? If so, this is generally known as `substring search` (even when it doesn't involve strings in whatever language you're working in), and you should be able to look up quite a few algorithms for it. – Damien_The_Unbeliever Jan 26 '16 at 07:27
  • Do you want them to show up in the same order? If order doesn't matter, then you can probably sort your initial array and then binary search for candidates. If the order matters, then there are various algorithms that can help you out. Off the top of my head Aho Corasick might be the fastest in your case. –  Jan 26 '16 at 07:29
  • You can try `IsSubsetOf` method as explained in [this answer](http://stackoverflow.com/questions/332973/check-whether-an-array-is-a-subset-of-another#333003). – Farhad Jabiyev Jan 26 '16 at 07:29
  • Is the size of the small array fixed? – xyz Jan 26 '16 at 08:09

6 Answers6

5

For performance you'll want something like the Boyer-Moore string search algorithm. While it's designed for strings, it should work just as well on byte arrays, and is much more performant than a brute-force solution.

The Wikipedia article provides several implementations, including one in Java and one in C, so creating a C# implementation should be fairly painless.


As it turns out, translating the Wikipedia article's Java implementation of Boyer-Moore to C# (and char to byte) was painless indeed. Here it is:

public static class BoyerMoore
{
    public static int IndexOf(byte[] haystack, byte[] needle)
    {
        if (needle.Length == 0)
        {
            return 0;
        }

        int[] charTable = MakeCharTable(needle);
        int[] offsetTable = MakeOffsetTable(needle);
        for (int i = needle.Length - 1; i < haystack.Length;)
        {
            int j;
            for (j = needle.Length - 1; needle[j] == haystack[i]; --i, --j)
            {
                if (j == 0)
                {
                    return i;
                }
            }

            i += Math.Max(offsetTable[needle.Length - 1 - j], charTable[haystack[i]]);
        }

        return -1;
    }

    private static int[] MakeCharTable(byte[] needle)
    {
        const int ALPHABET_SIZE = 256;
        int[] table = new int[ALPHABET_SIZE];
        for (int i = 0; i < table.Length; ++i)
        {
            table[i] = needle.Length;
        }

        for (int i = 0; i < needle.Length - 1; ++i)
        {
            table[needle[i]] = needle.Length - 1 - i;
        }

        return table;
    }

    private static int[] MakeOffsetTable(byte[] needle)
    {
        int[] table = new int[needle.Length];
        int lastPrefixPosition = needle.Length;
        for (int i = needle.Length - 1; i >= 0; --i)
        {
            if (IsPrefix(needle, i + 1))
            {
                lastPrefixPosition = i + 1;
            }

            table[needle.Length - 1 - i] = lastPrefixPosition - i + needle.Length - 1;
        }

        for (int i = 0; i < needle.Length - 1; ++i)
        {
            int slen = SuffixLength(needle, i);
            table[slen] = needle.Length - 1 - i + slen;
        }

        return table;
    }

    private static bool IsPrefix(byte[] needle, int p)
    {
        for (int i = p, j = 0; i < needle.Length; ++i, ++j)
        {
            if (needle[i] != needle[j])
            {
                return false;
            }
        }

        return true;
    }

    private static int SuffixLength(byte[] needle, int p)
    {
        int len = 0;
        for (int i = p, j = needle.Length - 1; i >= 0 && needle[i] == needle[j]; --i, --j)
        {
            len += 1;
        }

        return len;
    }
}

The algorithm spends a linear bit of time at the start, building its tables; from then it's blazingly fast.

Petter Hesselberg
  • 5,062
  • 2
  • 24
  • 42
3
static int search(byte[] haystack, byte[] needle)
{
    for (int i = 0; i <= haystack.Length - needle.Length; i++)
    {
        if (match(haystack, needle, i))
        {
            return i;
        }
    }
    return -1;
}

static bool match(byte[] haystack, byte[] needle, int start)
{
    if (needle.Length + start > haystack.Length)
    {
        return false;
    }
    else
    {
        for (int i = 0; i < needle.Length; i++)
        {
            if (needle[i] != haystack[i + start])
            {
                return false;
            }
        }
        return true;
    }
}
Gehan Fernando
  • 1,221
  • 13
  • 24
3

Use this:

bool isSubset = !t2.Except(t1).Any();

It's from @Farhad Jabiyev's Link: Check whether an array is a subset of another

Community
  • 1
  • 1
M. Schena
  • 2,039
  • 1
  • 21
  • 29
0

If you've got millions of elements in bytes, I'd suggest:

  1. sort bytes
  2. foreach in small if cannot find element return false

Thus

bytes.Sort(); // only need to do this once.
bool smallContained = ContainsAll(bytes, small);

and

static bool ContainsAll(int[] src, int [] subset)
{ 
    foreach(var i in subset)
       if (src.BinarySearch(i) < 0)
               return false;
    return true;
}
Jens Meinecke
  • 2,904
  • 17
  • 20
0

If I understand correctly you want to say if small is a subsequence of bytes. You can find it with a loop over bytes. It should run very fast thanks to processor caching.

  for (int i = 0, index = 0; i < bytes.Length; ++i) 
    if (bytes[i] == small[index]) {
      if (++index >= small.Length) {
        return true; 
      }
    }
  return false;
Sorin
  • 11,863
  • 22
  • 26
  • @PetterHesselberg Do you mean it doesn't correctly detect subsequences or that it doesn't address the confusing question at the top? – Sorin Jan 27 '16 at 08:26
  • It doesn't even compile as-is; the code contains one opening brace and two closing braces. But that's a quibble, easily fixed. The problem is that it finds false positives -- if the bytes of `small` exist in `bytes`, but with other values in-between, your algorithm still returns `true`. I do believe the OP wants a consecutive byte sequence. – Petter Hesselberg Jan 27 '16 at 09:04
  • @PetterHesselberg Thanks for finding the typo. Those are not false positives, that's the definition of a subsequence. They OP said "First, I need find bytes in that order which they was defined in small array" and I understand that as subsequence. The code above delivers that. I can't understand the second one. Hopefully the OP comes back and clarifies. – Sorin Jan 27 '16 at 09:43
0

You can use this function, from this Reddit post:

public static bool CheckSequence<T>(IEnumerable<T> containingArray, IEnumerable<T> containedArray)
{
    bool result = false;
    for (int i = 0; i <= containingArray.Count(); i++)
    {
        if (containedArray.SequenceEqual(containingArray.Skip(i).Take(containedArray.Count())))
            result = true;
    }
    return result;
}

Like:

var result = CheckSequence(bytes, small);
Manuel Fabbri
  • 542
  • 1
  • 7
  • 16