2

I want to search an int in a large (50mb+) byte array. What algorithm should I use? Maybe some unsafe method?

EDIT: It's not a int array, it's a byte array. The data is not sorted in any way.

blez
  • 4,939
  • 5
  • 50
  • 82
  • 3
    So, is the array bytes or ints? Doesn't make a lot of sense to "search an int in a... byte array". Is the data sorted? – Joe Aug 26 '11 at 19:03
  • 1
    Voting to close because this question is very incomplete, and no previous attempt was shown. Bytes are 8 bits, and "DWORDS" are generally 32 bits. What exactly do you mean? How is your data aligned? And what do you mean by "fastest"? On which processor? How often will you search the same list? How much memory can we consume? – Merlyn Morgan-Graham Aug 26 '11 at 19:04
  • 1
    @Merlyn Morgan-Graham: I think he wants to search for 4 consecutive bytes in that array, represented from the int value – BlackBear Aug 26 '11 at 19:08
  • 1) I'm searching for a Int32 2) my data isn't aligned 3) I mean fastest each call (not progressive) 4) doesn't matter, I'm doing it in .NET 2, so no parallel 5) only 1 time on every list, but I search every 2 secs 6) not sure. – blez Aug 26 '11 at 19:10
  • @blez: You can do multi-threaded operations without .Net 4, but you don't specify if you want "any" or "first", so that option may or may not be viable. Also, I can imagine a lot of scenarios where pushing this data into a different data structure could boost your performance exponentially. So "fastest" requires us to understand your whole project architecture. If you're talking 50+ mb, there is almost assuredly a better answer than chugging through an array on every search... – Merlyn Morgan-Graham Aug 26 '11 at 19:17
  • @Merlyn Morgan-Graham: I want the indexes of all occasions of the int in the byte array. – blez Aug 26 '11 at 19:21
  • Is it problem to convert this byte array to related int array then sort int array and find related numbers in O(log n) (after sorting)? – Saeed Amiri Aug 27 '11 at 05:53

9 Answers9

3
public IList<int> FindIntInBytes(byte[] bytes, int target)
{
    IList<int> found = new List<int>();

    unsafe
    {
        fixed (byte* pBytes = bytes)
        {
            byte* pCurrent = pBytes;
            for (int i = 0; i <= bytes.Length - 4; i++, pCurrent++)
            {
                if (target == *(int*)pCurrent)
                {
                    found.Add(i);
                }
            }
        }
    }

    return found;
}

Won't work on big-endian architectures but they are not used for most .Net applications.

Split into sections and run in multiple threads then merge results for faster performance depending on the size of the array and CPU availability.

Stephen Martin
  • 9,495
  • 3
  • 26
  • 36
2

Here's my implementation. Works in O(n);

int findInArray(byte[] array, int what)
{
    byte[] toMatch = /* convert your dword to a 4 elements byte array */;

    int matched = 0;
    for(int i = 0; i < array.length; i++) {
        if(array[i] == toMatch[matched]) {
            matched += 1;
            if(matched == 4) {
                return i - 4;
            }
        }
        else {
            i -= matched;
            matched = 0;
        }
    }

    return -1;
}
ChaosPandion
  • 77,506
  • 18
  • 119
  • 157
BlackBear
  • 22,411
  • 10
  • 48
  • 86
  • This won't work in some cases. e.g. `byteArray = 0x0505909090` and `int = 0x05909090` because the first 0x05 will match, but the second 0x05 will reset the match. And then you'll be at 0x90. – Jacob Eggers Aug 26 '11 at 19:14
  • @Jacob Eggers: hmm... Clever ;) Edited – BlackBear Aug 26 '11 at 19:16
  • Good, but slow, I guess if it will be faster if it's searched by ints, not bytes (Like C's int*) – blez Aug 26 '11 at 19:20
  • @blez: Yes, using pointers and unsafe code is something you'd surely want to do here. @BlackBear: The OP clarified they want every index. I'd suggest you return an `IEnumerable` and simply `yield return` your results. – Merlyn Morgan-Graham Aug 26 '11 at 19:23
  • @blez: yep, but I don't know if it's possible to convert a byte array to an int array. It would be much simpler: if(array.Contains(.)) ... – BlackBear Aug 26 '11 at 19:24
  • @BlackBear - Iterating through the array with `GetEnumerator()` and `MoveNext()` is actually faster. And the improvement should be very noticeable considering how little the algorithm is doing for each iteration. – Steve Wortham Aug 26 '11 at 23:37
  • @Steve Wortham: it won't work because you could have to get back on partial matches (the else part) – BlackBear Aug 26 '11 at 23:53
1

What you are essentially doing is looking for a substring in string. So you'll want to look at string search algorithms.

BlackBear suggestion is a naive string search. You could also use, for example, the Knuth–Morris–Pratt algorithm

Jacob Eggers
  • 9,062
  • 2
  • 25
  • 43
1

This sounds like a job where you'd possibly want to extract the integers from the array and set up a simple hash table or binary tree, if you're doing a lot of searches on the same data. Databases have indexes for the same reason. You can get N/2 performance or better, depending on your index.

See this article: How does database indexing work?

And this article: http://en.wikipedia.org/wiki/Binary_tree

If you want to go this route, open a new question about which one would be more appropriate for the task you're working on.

Community
  • 1
  • 1
djdanlib
  • 21,449
  • 1
  • 20
  • 29
1

Algorithmically, there's no shortcut to searching the whole thing. Implementation-wise, if performance is going to be a big deal, the best you can do is write your code to avoid memory reads, branches, and function calls wherever possible. This will make it easier for the compiler to generate efficient code (although clever compilers may anyway and it's difficult to guarantee anything about eventual machine code when you're compiling to a VM which then recompiles it into machine code to run). My implementation would look like this:

System.Collections.Generic.IEnumerable<int> FindIntInByteArray(int match, byte[] array) {
    if (array.Length < 4) yield break;
    byte b0 = 0;
    byte b1 = array[0];
    byte b2 = array[1];
    byte b3 = array[2];
    int len = array.Length;
    for (int i=3;i<len;i++) {
        b0 = b1;
        b1 = b2;
        b2 = b3;
        b3 = array[i];
        /* The following line should be changed depending on endian-ness or which
           bytes are to be considered most significant. */
        int comp = (b0 << 24) | (b1 << 16) | (b2 << 8) | b3;
        if (comp == match) yield return i-3;
    }
}
Keith Irwin
  • 5,628
  • 22
  • 31
  • Good code, it's as fast as the unsafe code of Stephen Martin. – blez Aug 27 '11 at 01:48
  • No, excuse me, it's a little bit _faster_. Which is strange. – blez Aug 27 '11 at 01:57
  • @blez It's not that strange. You have to keep in mind that there's really no such thing as non-aligned memory access. So when you dereference a pointer which is non-aligned, you're really reading two ints and then throwing away parts of them and the reassembling them. You don't usually write that part in the code yourself, but it's happening. The biggest difference between mine and his is that I write some of that part in the code and he doesn't. He might be having one more memory read per loop as a result since my code is likely to carry some of this over in registers. (cont.) – Keith Irwin Aug 27 '11 at 03:58
  • And he may also be losing a little performance by not storing bytes.Length-4 in a variable since that makes the compiler less likely to be certain that it doesn't change and more likely to recompute it every loop. It might be possible to speed mine up some by unrolling the loop some into aligned groups of 4 bytes and reading them in as ints in order to be sure that memory reads are being minimized (this would require unsafe in much the same way he uses it). – Keith Irwin Aug 27 '11 at 04:03
  • Aha, but please fix your code, it has errors. And I'll try it. – blez Aug 27 '11 at 09:40
  • @blez Sorry about that. I didn't have a compiler handy when I was answering. It went fine for the first one, but the second one got more complex. Mostly, I just needed to delete one extraneous line which hadn't gotten erased when I refactored something and to change to explicit byte conversions (which it turns out does the masking for me). Hopefully the explicit byte conversions won't slow things down, but we'll have to see. I also didn't know that you can't have unsafe blocks inside of an enumeration, so I've switched to IList. Anyway, it should all be working now. – Keith Irwin Aug 27 '11 at 17:47
  • Yes, thank you, but the first solution is at least 1/4 times faster. – blez Aug 27 '11 at 21:30
  • Well darn. I think that the int to byte conversion might be slowing it down, so I'm switching my bytes to ints to avoid that and also restructuring a little for improved odds of instruction-level parallelism. Give this version a try. If this isn't faster than my original, then I'll just delete it and give up :) – Keith Irwin Aug 28 '11 at 05:14
0

Even in .Net 2.0 you can create new threads that will allow you to parallelize the searching of it. You don't mention if you are looking for all instances of the int. I can think of several ways of improving the speed than a straight search, including pre-processing the array into dictionaries for lookups, etc. if you always use the same array to find the data, etc.

Jared Peless
  • 1,120
  • 9
  • 11
0

Here's one method. It only requires looking at every 4th byte, so should be slightly faster. (If you're looking for 0x11223344, and you find a 0x55, you know that the target integer doesn't contain this byte at all.) Should be O(n/4).

I didn't run this, there may be syntax or off-by-one errors.

bool Compare4(byte[] searchIn, int offset, byte[] searchFor)
{
    return searchIn[offset]   == searchFor[0] &&
           searchIn[offset+1] == searchFor[1] &&
           searchIn[offset+2] == searchFor[2] &&
           searchIn[offset+3] == searchFor[3];
}

/// Returns index if found, -1 if not found.
int FindIntInArray(int target, byte[] array)
{
    byte[] targetArray = new byte[4];
    targetArray[0] = target & 0xFF;
    targetArray[1] = (target>>8) & 0xFF;
    targetArray[2] = (target>>16) & 0xFF;
    targetArray[3] = (target>>24) & 0xFF;

    bool[] bytesUsed = new bool[256];
    foreach(byte b in targetArray) bytesUsed[b] = true;

    for(int i = 3; i < array.Length - 3; i += 4)
    {
        if(bytesUsed[array[i]])
        {
            if(Compare4(array, i-3, targetArray)) return i-3;
            if(Compare4(array, i-2, targetArray)) return i-2;
            if(Compare4(array, i-1, targetArray)) return i-1;
            if(Compare4(array, i, targetArray)) return i;
        }
    }

    return -1;
}
David Yaw
  • 27,383
  • 4
  • 60
  • 93
  • Hmm.. suppose I'd like to look for 0x11223344 into something like 0x11, 0x11, 0x22, 0x33, 0x44. That code wouldn't catch this – BlackBear Aug 26 '11 at 23:30
  • Interesting solution, but loading a chunk from memory is the expensive operation, not the comparison. So loading in 4 byte chunks will be faster than loading a byte. – blez Aug 27 '11 at 00:52
0

If I understand your question properly, you want to search a byte array for what amounts to a particular pattern of 4 bytes. The following should do the trick, finding the int value at any position within the array—no assumption are made about alignment.

Edited to note that

  • it runs in O(n) time, and
  • any byte order issues are your problem.

Here's the code:

private static int FindIntValueInByteArray( int value , byte[] octets )
{
  int matchPosition = -1 ; // assume no match

  for ( int i = 0 ; i < octets.Length-3 ; ++i )
  {
    int t = BitConverter.ToInt32( octets , i ) ;
    if ( t == value )
    {
      matchPosition = i ;
      break ;
    }
  }

  return matchPosition ;
}
Nicholas Carey
  • 71,308
  • 16
  • 93
  • 135
0
public static class ByteListExtensions
{
    public static IEnumerable<int> AllIndexesOf(this IList<byte> source, int match, bool isLittleEndian = true)
    {
        if (source.Count < 4)
        {
            return Enumerable.Empty<int>();
        }
        var b0 = (byte)(match & (isLittleEndian ? 0xff000000 : 0x000000ff));
        var b1 = (byte)(match & (isLittleEndian ? 0x00ff0000 : 0x0000ff00));
        var b2 = (byte)(match & (isLittleEndian ? 0x0000ff00 : 0x00ff0000));
        var b3 = (byte)(match & (isLittleEndian ? 0x000000ff : 0xff000000));
        var indexes = Enumerable.Range(0, source.Count / 4)
            .AsParallel()
            .Select(x => x * 4)
            .Where(x => source[x] == b0)
            .Where(x => source[x + 1] == b1)
            .Where(x => source[x + 2] == b2)
            .Where(x => source[x + 3] == b3);
        return indexes;
    }
}

sample usage:

var callingAssembly = Assembly.GetCallingAssembly();
var bytes = File.ReadAllBytes(callingAssembly.Location);
const int intToFind = 42;
var indexes = bytes.AllIndexesOf(intToFind);
foreach (var index in indexes)
{
    Console.WriteLine(index);
}
Handcraftsman
  • 6,863
  • 2
  • 40
  • 33