0

I have an int array[10000] and I want to iterate from a certain position to find the next non-zero index. Currently I use a basic while loop:

while(array[i] == 0){
    pos++;
}

etc

I know with intrinsics I could test 4 integers for zero at a time, but is there a way to return something indicating the vector index of the "first" non-zero?

user997112
  • 29,025
  • 43
  • 182
  • 361
  • 1
    Have you tried the simple solution: [`std::find`](http://en.cppreference.com/w/cpp/algorithm/find)? Unless you want to search millions of records it's no use complicating things. – Some programmer dude Apr 24 '14 at 16:47
  • @JoachimPileborg I am looking for low latency so its worth it. I wouldnt bother asking if I didnt need it. Appreciate the advice- but I'm asking about intrinsics because I need the speed. – user997112 Apr 24 '14 at 16:52
  • I think you're looking for the next non-zero *item*, not *index*...? – CiaPan Apr 24 '14 at 16:53
  • @CiaPan no the index the item was found at. – user997112 Apr 24 '14 at 16:54
  • And you *have* benchmarked your current (and other) solutions? Unless you're coding for a very old or slow computer looping over just 10000 items is quite fast on a modern multi-GHz machine. – Some programmer dude Apr 24 '14 at 16:56
  • 1
    Ok, you need an index, anyway it's an *item* which has to be non-zero. :) IMHO if your array is not very sparse, you shouldn't worry about sequential testing. But if it is, maybe you should consider applying some special data structure for effective handling sparse tables? – CiaPan Apr 24 '14 at 16:58
  • your array is pretty small, but if it were bigger, you could save time on a multi-core machine by spawning a few threads to search it. – Red Alert Apr 24 '14 at 17:14
  • @RedAlert I did wonder at what point using multiple threads would be advantageous. – user997112 Apr 24 '14 at 17:22
  • at whatever point the overhead of creating new threads is less than what it takes to find the next non-zero in your array. When exactly that is, depends on the size & distribution of your array. – Red Alert Apr 24 '14 at 17:24
  • 1
    Optimizing for wide instruction/comparison (like Haswell AVX2) makes sense only if you are confident that the data is in the cache, the virtual pages are in the TLB, and the array is sparse enough. If you end up getting a TLB miss or cache miss (hundreds of cycles to resolve) then which instruction you use to load and compare doesn't really matter. See http://stackoverflow.com/questions/15172102/using-c-intel-assembly-what-is-the-fastest-way-to-test-if-a-128-byte-memory-blo – amdn Apr 24 '14 at 18:59

1 Answers1

3

It's fairly simple to do this, but throughput improvement may not be great, since you will probably be limited by memory bandwidth (unless your array is already cached):

int index = -1;
for (i = 0; i < n; i += 4)
{
    __m128i v = _mm_load_si128(&A[i]);
    __m128i vcmp = _mm_cmpeq_epi32(v, _mm_setzero_si128());
    int mask = _mm_movemask_epi8(vcmp);
    if (mask != 0xffff)
    {
        break;
    }
}
if (i < n)
{
    for (j = i; j < i + 4; ++j)
    {
        if (A[j] != 0)
        {
             index = j;
             break;
        }
    }
}

This assumes that the array A is 16 byte aligned, its size, n, is a multiple of 4, and that the ints are 32 bits.

Loop unrolling by a factor of 2 may help, particularly if your input data is large and/or sparse, e.g.

int index = -1;
for (i = 0; i < n; i += 8)
{
    __m128i v0 = _mm_load_si128(&A[i]);
    __m128i v1 = _mm_load_si128(&A[i + 4]);
    __m128i vcmp0 = _mm_cmpeq_epi32(v0, _mm_setzero_si128());
    __m128i vcmp1 = _mm_cmpeq_epi32(v1, _mm_setzero_si128());
    int mask0 = _mm_movemask_epi8(vcmp0);
    int mask1 = _mm_movemask_epi8(vcmp1);
    if ((mask0 | mask1) != 0xffff)
    {
        break;
    }
}
if (i < n)
{
    for (j = i; j < i + 8; ++j)
    {
        if (A[j] != 0)
        {
             index = j;
             break;
        }
    }
}

If you have AVX2 (Haswell and later) then you can process 8 ints at a time rather than 4.

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • Could use a while loop and mask as the continuation condition? – user997112 Apr 24 '14 at 17:54
  • I don't think it would make much difference, but go ahead and try it. – Paul R Apr 24 '14 at 17:56
  • @PaulR, good answer, but I don't see how unrolling will help. There is no dependency chain (such as an accumulator). – Z boson Apr 24 '14 at 18:57
  • It depends on the CPU and the compiler (and compiler switches) but in my experience small loops often benefit from 2x unrolling, if it's done intelligently. Note that this also allows you to amortize the cost of the mask test and branch. I'll post an unrolled example if that's not clear and you can try both if you like. – Paul R Apr 25 '14 at 00:13
  • @PaulR, I think the unrolled version is different than what the out of order unit would do. So it could help. Out of order could start the next test but if that test failed before it finished the previous test it would have to wait for the previous one to finish whereas with the unroll it would bail with either one. I guess the conditional test is a kind of dependency chain. Just not the way I'm used to thinking about it. – Z boson Apr 25 '14 at 07:12
  • Well even without the merged test, unrolling the loop can still help to hide some load latency. – Paul R Apr 25 '14 at 13:49
  • Which header file defines the "_mm_load_epi32" method?. I cannot find it. – warunapww May 29 '15 at 20:01
  • 1
    @warunapww: sorry - that's just a convenience macro that I use - it corresponds to `_mm_load_si128` - answer updated. – Paul R May 29 '15 at 22:20
  • Your solution is fast only when there are very few nonzero elements. I guess even if there are only 10% of them, you'll already have performance problems. On the other hand, stream compaction can be done on SSE without branches by `_mm_shuffle_epi8` combined with LUT, see [here](http://stackoverflow.com/q/6270641/556899). If there a many nonzero elements, it would help. – stgatilov Sep 05 '15 at 18:19
  • @stgatilov: yes, I was assuming that the array is fairly sparse - I should probably have made that clearer in the answer. – Paul R Sep 05 '15 at 18:24