28

I'm looking to optimize this linear search:

static int
linear (const int *arr, int n, int key)
{
        int i = 0;
        while (i < n) {
                if (arr [i] >= key)
                        break;
                ++i;
        }
        return i;
}

The array is sorted and the function is supposed to return the index of the first element that is greater or equal to the key. They array is not large (below 200 elements) and will be prepared once for a large number of searches. Array elements after the n-th can if necessary be initialized to something appropriate, if that speeds up the search.

No, binary search is not allowed, only linear search.

Edit: All my knowledge about this topic is now summarized in this blog post.

jkdev
  • 11,360
  • 15
  • 54
  • 77
Mark Probst
  • 7,107
  • 7
  • 40
  • 42
  • 5
    The only thing you can do is take advantage of any SIMD instructions available on your platform. (Test four at a time, for example.) Though why you wouldn't binary search, I don't know. – GManNickG Apr 30 '10 at 01:53
  • 2
    You don't have to test every element; you can test every kth element if you are then allowed to go back. Also, if you know the range of elements you can set up an array / hash table which just gives you the answer. But, you might not consider these "linear search". – Nathan S. Apr 30 '10 at 01:53
  • 31
    Why is binary search (arbitrarily?) not allowed? Is this a real problem or some kind of homework? Because if you're going to go through the trouble of sorting the data, a binary search is going to be your best performer. – Joe Apr 30 '10 at 01:54
  • (Random thought: one two, skip a few -- but that might fall into 'not linear', and if it's already sorted, given any non-trivial n without the match expected near the front and associated locality issues ...) –  Apr 30 '10 at 01:55
  • @pst, That's basically a binary search, where "a few" is "remaining array size / 2". – strager Apr 30 '10 at 01:58
  • @Joe: Binary searsh will performs best only if you know that the element you are looking for is located in a completely unpredictable position. If you know that the target element is likely to be close to the beginning of the array, linear search will outperform binary search. Classic example is the well-known algorithm for merging two sorted arrays. If the arrays have approximately equal length, the merging is done with *linear* search, since binary will be much slower. – AnT stands with Russia Apr 30 '10 at 01:58
  • 1
    Remember binary searches require your data set to be sorted. – BobbyShaftoe Apr 30 '10 at 01:59
  • 1
    actually, I have read somewhere that for small array, linear search can be faster: http://lwn.net/Articles/255364/ (discussion is in comments) – Anycorn Apr 30 '10 at 02:00
  • Would it be considered cheating if you scanned every tenth element first and after finding the first not less than the key, going back and scanning last ten element one by one? How about a square root of n instead of 10? – Maciej Hehl Apr 30 '10 at 02:02
  • @AndreyT - That's great for specific cases, but there's nothing to indicate there's anything special about this dataset. In the generic case, a sorted but otherwise pedestrian set of data is going to work best in most use cases with a binary search. – Joe Apr 30 '10 at 02:04
  • 3
    Yes, not scanning every element would be cheating. @GMan: There's a LOT you can do before having to resort to SIMD. @Joe: This is "homework" I've given myself, which I've also already done. I'm just curious what people come up with that I haven't thought of. – Mark Probst Apr 30 '10 at 02:06
  • @GMan: See the top-rated answer, for example. Much faster than plain linear search (I know, I've benchmarked). – Mark Probst Apr 30 '10 at 02:15
  • 1
    Unrolling by four speeds up by almost 50% at N=100 on a Core i7. Unrolling by four with a sentinel speeds up by more than 50%. – Mark Probst Apr 30 '10 at 02:22
  • 1
    Still no solutions using multiple threads? – too much php Apr 30 '10 at 02:55
  • @Mark Probst: Yeah unrolling can speed things up, guess I need to re-read my Code Complete book :-) Here's the unrolling topic from that book http://www.stevemcconnell.com/cctune.htm – Michael Buen Apr 30 '10 at 03:01
  • 2
    @Joe: Yes, but look at the problem: they have to implement *linear* search in a *sorted* array. This is already sufficiently non-generic. This already implies something specific. Why would anyone insist on a *linear* search in a *sorted* array? Maybe they do it because the structure of the queries favors linear search specifically? For example, if you have an ordered array of N elements and have to make close to N *ordered* search queries, incremental linear search will outperform binary search by an order of magnitude at least. – AnT stands with Russia Apr 30 '10 at 04:17
  • @Mark Probst: you compile with optimizations enabled, right? – Bastien Léonard Apr 30 '10 at 08:21
  • 1
    binary search with some a-priory knowledge of the distribution of the data will be still faster than linear search in most cases; the simplest is to use the distances to linearly interpolate the splitting point for the next iteration instead of going to the middle of the range (this works best if the data is uniformly distributed, in other cases the interpolation formula should be close to that distribution) – fortran Apr 30 '10 at 10:32
  • 1
    If someone insisted on using a linear search on sorted data, I would just return a random result, because someone that foolish wouldn't know the difference. – mikerobi Apr 30 '10 at 20:36
  • @fortran: Well, if you have to make M *sorted* queries into an array of N *sorted* elements, the asymptotically optimal algoirithm proceeds as follows: first we perform straddled *linear* search with step [N/M], i.e. do linear search skipping to each [N/M]-th element and then do the *binary* search in the found segment of length [N/M]. When M is close to N, [N/M] becomes small and binary search gets "disabled". So, no binarey search, under the above conditions (i.e. sufficiently dense sorted queries to the same data) will not be faster than linear search. Linear search will be much faster. – AnT stands with Russia Apr 30 '10 at 20:50
  • @fortran: The above blend of straddled-linear search followed by binary search is proven to be asymptotically optimal algorithm that achieves the theoretical limit of search efficiency. So, no binary search is only faster when the queries are sparse. With the dense queries, linear search wins by a huge margin. And, again, for intermediate cases the optimal algorithm uses the blend of the two. The theoretical result is available from this article: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.22.5750 – AnT stands with Russia Apr 30 '10 at 20:54
  • 1
    @mikerobi: Which would probably make it much more surprising to you to find out that incremental linear search on sorted data makes perfect sense when we have to make multiple sorted queries. When the number of the queries approaches the size of the data, linear search outperforms the binary search by an order of magnitude. Moreover, for that reason virtually everybody uses it (i.e. linear search) when merging sorted data. They just don't realize that. – AnT stands with Russia Apr 30 '10 at 21:20
  • @Suma this is not a `[code-golf]` problem. If it was left tagged like that it would be closed because code-golf problems that aren't CW constantly get closed – Earlz Apr 30 '10 at 22:30
  • 1
    @AndreyT, merging sorted data takes linear time, but it is not a linear search. You are correct about searching for multiple values at one time, but that was not part of the OP problem statement. – mikerobi May 01 '10 at 18:27
  • 1
    @mikerobi: Yes, it is linear search :) The classic merging algorithm for sorted arrays is based on taking the current minimal element from the two arrays and sending it to output. It is not obvious, but this is in fact nothing else than a plain *linear search* of elements from one array in another array. It is simply obfuscated a little, so you won't see it right away, but it reality it is plain and straighforward linear search. – AnT stands with Russia May 01 '10 at 19:12
  • Moreover, the same logic applies to merging as well: if you are merging two arrays of significantly different length, it is better to switch to *binary* search for merging. But if the length are about the same, we use the classic algorithm with linear search. – AnT stands with Russia May 01 '10 at 19:13
  • @AndreyT, I disagree, since the lists are sorted, you have pointers to the smallest (or largest) element in each, so you are never searching for the next element of one of the sublists, you are only deciding which sublist to take the element from. – mikerobi May 01 '10 at 20:04
  • @mikerobi: No, you are simply insisting on one specific vision of what happens. Here's the althernative vision for you: to perform the merge we take the first element `a` from sorted array `A` and perform the *linear search* for that element in sorted array `B`. That gives us a [possibly empty] sequence of leading elements in `B`, which are smaller than `a`. We move that entire sequence to output, followed by `a`. Then we repeat: take the next element `a` from `A`... and so on. That's it. – AnT stands with Russia May 01 '10 at 20:23
  • At the first sight it might sound like a different algorithm, while if you think about it a bit, you'll see that this is *exactly* the same classic merging algorithm, just described in different terms :) Again, the classic merging algorithm is nothing more than a lightly obfuscated *linear search*. And again, if the arrays have different length, the proper way to do the merging is to use *binary* search: take `a` from `A`, do *binary* search in `B`, move the begining sequence from `B` to output, move `a` to output, repeat. – AnT stands with Russia May 01 '10 at 20:28
  • 1
    And again, the universal asymptotically *optimal* strategy is a mix of linear *and* binary searches, as described in my answer below. – AnT stands with Russia May 01 '10 at 20:28
  • 4
    I'm voting to close this question as off-topic because it better suits on [codereview.SE]. – double-beep Jul 21 '19 at 10:02
  • @double-beep: this isn't asking for a code-review of the simple scalar linear search, it's using that to describe/show the algorithm to be vectorized. And besides, it's too old to migrate and has existing answers, so if that first argument didn't convince you, then I'd still suggest making an exception to the rule for this historical question. – Peter Cordes Aug 08 '19 at 08:40

19 Answers19

20

So far you received multiple advice most of which state that linear search makes no sense on sorted data, when binary search will work much more efficiently instead. This often happens to be one of those popular "sounds right" assertions made by people who don't care to give the problem too much thought. In reality, if you consider the bigger picture, given the right circumstances, linear search can be much more efficient than binary search.

Note, that if we consider a single search query for a sorted array, binary search is significantly more efficient method than linear search. There's no argument about that. Also, when you perform multiple completely random queries to the same data binary search still wins over linear search.

However, the picture begins to change if we consider sequential search queries and these queries are not exactly random. Imagine that queries arrive in sorted order, i.e. each next query is for a higher value than the previous query. I.e. the queries are also sorted. BTW, they don't have to be globally and strictly sorted, from time to time the query sequence might get "reset", i.e. a low value is queried, but on average the consequent queries should arrive in increasing order. In other words, the queries arrive in series, each series sorted in ascending order. In this case, if the average length of the series is comparable to the length of your array, linear search will outperform binary search by a huge margin. However, to take advantage of this situation, you have to implement your search in incremental manner. It is simple: if the next query is greater than the previous one, you don't need to start the search from the beginning of the array. Instead, you can search from the point where the previous search stopped. The most simplistic implementation (just to illustrate the idea) might look as follows

static int linear(const int *arr, int n, int key)
{
  static int previous_key = INT_MIN;
  static int previous_i = 0;

  i = key >= previous_key ? previous_i : 0;

  while (i < n) {
    if (arr[i] >= key)
      break;
    ++i;
  }

  previous_key = key;
  previous_i = i;

  return i;
}

(Disclaimer: the above implementation is terribly ugly for the obvious reason that the array is arriving from outside as a parameter, while the previous search state is stored internally. Of course, this is wrong way to do it in practice. But again, the above is intended to illustrate the idea and no more).

Note, that the complexity of processing each series of ordered queries using the above approach is always O(N), regardless of the length of the series. Using the binary search, the complexity would be O(M * log N). So, for obvious reasons when M is close to N, i.e. queries arrive in sufficiently long ordered series, the above linear search will significantly outperform binary search, while for small M the binary search will win.

Also, even if the ordered series of queries are not very long, the above modification might still give you a noticeable improvement in search performance, considering that you have to use linear search.

P.S. As an additional piece of information about the structure of the problem:

When you need to perform the search in an ordered array of length N and you know in advance that the queries will arrive in ordered series of [approximate, average] length M, the optimal algorithm will look as follows

  1. Calculate the stride value S = [N/M]. It might also make sense to "snap" the value of S to the [nearest] power of 2. Think of your sorted array as a sequence of blocks of length S - so called S-blocks.
  2. After receiving a query, perform incremental linear search for the S-block that potentially contains the queried value, i.e. it is an ordinary linear search with stride S (of course, remember to start from the block where the previous search left off).
  3. After finding the S-block, perform the binary search within the S-block for the queried value.

The above is the most optimal incremental search algorithm possible, in a sense that it achieves the theoretical limit on the asymptotic efficiency of repetitive search. Note, that if the value of M is much smaller then N, the algorithm "automatically" shifts itself towards binary search, while when M gets close to N the algorithm "automatically" favors linear search. The latter makes sense because in such environment linear search is significantly more efficient than binary search.

This all is just to illustrate the fact that blanket statements like "linear search on a sorted array is always useless" indicate nothing else than lack of knowledge on the part of those who make such statements.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
AnT stands with Russia
  • 312,472
  • 42
  • 525
  • 765
  • 2
    I think this is the best answer since the OP said "for a large number of searches". – temp2290 May 03 '10 at 15:36
  • Related: [What is the most efficient way to implement a BST in such a way the find(value) function is optimized for random values in the tree on x86?](//stackoverflow.com/a/56614414) A *binary* search tree isn't always the best data structure for cases where linear isn't good. An N-ary tree where N-1 is some multiple of the SIMD vector width allows efficient searching on modern x86. e.g. 17-ary for 4x 4-element SIMD vectors, with much better spatial locality than a binary search over a sorted array, and fewer steps. SIMD can be very good for linear searching, too. – Peter Cordes Aug 08 '19 at 08:34
13

First of all, any fast solution must use vectorization to compare many elements at once.

However, all the vectorized implementations posted so far suffer from a common problem: they have branches. As a result, they have to introduce blockwise processing of the array (to reduce overhead of branching), which leads to low performance for small arrays. For large arrays linear search is worse than a well-optimized binary search, so there is no point in optimizing it.

However, linear search can be implemented without branches at all. The idea is very simple: the index you want is precisely the number of elements in the array that are less than the key you search for. So you can compare each element of the array to the key value and sum all the flags:

static int linear_stgatilov_scalar (const int *arr, int n, int key) {
    int cnt = 0;
    for (int i = 0; i < n; i++)
        cnt += (arr[i] < key);
    return cnt;
}

A fun thing about this solution is that it would return the same answer even if you shuffle the array =) Although this solution seems to be rather slow, it can be vectorized elegantly. The implementation provided below requires array to be 16-byte aligned. Also, the array must be padded with INT_MAX elements because it consumes 16 elements at once.

static int linear_stgatilov_vec (const int *arr, int n, int key) {
    assert(size_t(arr) % 16 == 0);
    __m128i vkey = _mm_set1_epi32(key);
    __m128i cnt = _mm_setzero_si128();
    for (int i = 0; i < n; i += 16) {
        __m128i mask0 = _mm_cmplt_epi32(_mm_load_si128((__m128i *)&arr[i+0]), vkey);
        __m128i mask1 = _mm_cmplt_epi32(_mm_load_si128((__m128i *)&arr[i+4]), vkey);
        __m128i mask2 = _mm_cmplt_epi32(_mm_load_si128((__m128i *)&arr[i+8]), vkey);
        __m128i mask3 = _mm_cmplt_epi32(_mm_load_si128((__m128i *)&arr[i+12]), vkey);
        __m128i sum = _mm_add_epi32(_mm_add_epi32(mask0, mask1), _mm_add_epi32(mask2, mask3));
        cnt = _mm_sub_epi32(cnt, sum);
    }
    cnt = _mm_hadd_epi32(cnt, cnt);
    cnt = _mm_hadd_epi32(cnt, cnt);
//  int ans = _mm_extract_epi32(cnt, 0);    //SSE4.1
    int ans = _mm_extract_epi16(cnt, 0);    //correct only for n < 32K
    return ans;
}

The final reduction of a single SSE2 register can be implemented with SSE2 only if necessary, it should not really affect the overall performance.

I have tested it with Visual C++ 2013 x64 compiler on Intel Core2 Duo E4700 (quite old, yeah). The array of size 197 is generated with elements provided by rand(). The full code with all the testing is here. Here is the time to perform 32M searches:

[OP]
Time = 3.155 (-896368640) //the original OP's code
[Paul R]
Time = 2.933 (-896368640)
[stgatilov]
Time = 1.139 (-896368640) //the code suggested

The OP's original code processes 10.6 millions of array per second (2.1 billion elements per second). The suggested code processes 29.5 millions of arrays per second (5.8 billion elements per second). Also, the suggested code works well for smaller arrays: even for arrays of 15 elements, it is still almost three times faster than OP's original code.

Here is the generated assembly:

$LL56@main:
    movdqa  xmm2, xmm4
    movdqa  xmm0, xmm4
    movdqa  xmm1, xmm4
    lea rcx, QWORD PTR [rcx+64]
    pcmpgtd xmm0, XMMWORD PTR [rcx-80]
    pcmpgtd xmm2, XMMWORD PTR [rcx-96]
    pcmpgtd xmm1, XMMWORD PTR [rcx-48]
    paddd   xmm2, xmm0
    movdqa  xmm0, xmm4
    pcmpgtd xmm0, XMMWORD PTR [rcx-64]
    paddd   xmm1, xmm0
    paddd   xmm2, xmm1
    psubd   xmm3, xmm2
    dec r8
    jne SHORT $LL56@main
$LN54@main:
    phaddd  xmm3, xmm3
    inc rdx
    phaddd  xmm3, xmm3
    pextrw  eax, xmm3, 0

Finally, I'd like to note that a well-optimized binary search can be made faster by switching to the described vectorized linear search as soon as the interval becomes small.

UPDATE: More information can be found in my blog post on the matter.

stgatilov
  • 5,333
  • 31
  • 54
12

Since you can put known values after the last valid entry, add an extra element n+1 = max to make sure the loop doesn't go past the end of the array without having to test for i < n.

static int
linear (const int *arr, int n, int key)
{
        assert(arr[n] >= key);
        int i = 0;
        while (arr[i] < key) {
                ++i;
        }
        return i;
}

You could also try unrolling the loop, with the same sentinel value:

static int
linear (const int *arr, int n, int key)
{
        assert(arr[n] >= key);
        int i = 0;
        while (true) {
                if (arr [i++] >= key)
                        break;
                if (arr [i++] >= key)
                        break;
                if (arr [i++] >= key)
                        break;
                if (arr [i++] >= key)
                        break;
        }
        return --i;
}
Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • Correct in principle, but incorrect in detail. The sentinel must be greater or equal to the key, not less. – Mark Probst Apr 30 '10 at 02:10
  • Took a few edits to get these right, sorry if anybody is confused. – Mark Ransom Apr 30 '10 at 02:10
  • Also, the assert is incorrect, apart from the sign. The element after the last one has index n, not n+1. – Mark Probst Apr 30 '10 at 02:12
  • @Mark, thanks for spotting n+1, I guess I'm not done editing. And I think you're right about the sentinel too, which is how I had it first - I'm trying to do this too fast. – Mark Ransom Apr 30 '10 at 02:14
  • As long as we're micro-optomizing, you may as well start `i` at `-1` and then pre-increment it in the array indexing in your unrolled loop example. That will save you the additional `--i` at the end. – mbauman Apr 30 '10 at 17:16
  • @Matt, since the decrement is outside the loop it won't make a measurable difference in the results. I would also need to change all the postincrements to preincrements. Thanks for the suggestion though. – Mark Ransom Apr 30 '10 at 21:52
  • @Mark Ransom: Why unrolling 4 times? Why not 2 or 8 or 16? – Lazer Jun 13 '10 at 10:41
  • @Lazer, you could certainly do that but there's a point of diminishing returns. Also the larger the code, the more possibility of cache problems. – Mark Ransom Jun 13 '10 at 14:51
  • 1
    @Mark Ransom: yes, I do understand that, but how did you arrive at 4 anyways? Also, I am not sure about the "of course" part of ["For n < 4 this unrolling will not speed up search at all, of course"](http://schani.wordpress.com/2010/04/30/linear-vs-binary-search/). – Lazer Jun 13 '10 at 17:16
4

If a target-specific solution is acceptable then you can quite easily use SIMD (SSE, AltiVec, or whatever you have available) to get ~ 4x speed-up by testing 4 elements at a time rather than just 1.

Out of interest I put together a simple SIMD implementation as follows:

int linear_search_ref(const int32_t *A, int32_t key, int n)
{
    int result = -1;
    int i;

    for (i = 0; i < n; ++i)
    {
        if (A[i] >= key)
        {
            result = i;
            break;
        }
    }
    return result;
}

int linear_search(const int32_t *A, int32_t key, int n)
{
#define VEC_INT_ELEMS 4
#define BLOCK_SIZE (VEC_INT_ELEMS * 32)
    const __m128i vkey = _mm_set1_epi32(key);
    int vresult = -1;
    int result = -1;
    int i, j;

    for (i = 0; i <= n - BLOCK_SIZE; i += BLOCK_SIZE)
    {
        __m128i vmask0 = _mm_set1_epi32(-1);
        __m128i vmask1 = _mm_set1_epi32(-1);
        int mask0, mask1;

        for (j = 0; j < BLOCK_SIZE; j += VEC_INT_ELEMS * 2)
        {
            __m128i vA0 = _mm_load_si128(&A[i + j]);
            __m128i vA1 = _mm_load_si128(&A[i + j + VEC_INT_ELEMS]);
            __m128i vcmp0 = _mm_cmpgt_epi32(vkey, vA0);
            __m128i vcmp1 = _mm_cmpgt_epi32(vkey, vA1);
            vmask0 = _mm_and_si128(vmask0, vcmp0);
            vmask1 = _mm_and_si128(vmask1, vcmp1);
        }
        mask0 = _mm_movemask_epi8(vmask0);
        mask1 = _mm_movemask_epi8(vmask1);
        if ((mask0 & mask1) != 0xffff)
        {
            vresult = i;
            break;
        }
    }
    if (vresult > -1)
    {
        result = vresult + linear_search_ref(&A[vresult], key, BLOCK_SIZE);
    }
    else if (i < n)
    {
        result = i + linear_search_ref(&A[i], key, n - i);
    }
    return result;
#undef BLOCK_SIZE
#undef VEC_INT_ELEMS
}

On a 2.67 GHz Core i7, using OpenSUSE x86-64 and gcc 4.3.2, I get around 7x - 8x improvement around a fairly broad "sweet spot" where n = 100000 with the key being found at the midpoint of the array (i.e. result = n / 2). Performance drops off to around 3.5x when n gets large and the array therefore exceeds cache size (presumably becoming memory bandwidth-limited in this case). Performance also drops off when n is small, due to inefficiency of the SIMD implementation (it was optimised for large n of course).

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • You can use SIMD, but the speedup will not be 4x, especially not for small arrays. Tested with SSE2 on a Core i7. I'd be interested in your implementation. – Mark Probst Apr 30 '10 at 11:09
  • 1
    For small arrays, perhaps not, but for large arrays I think you should be able to hit 4x using SIMD. I would unroll the main loop by 2 so that you have two vector loads issued per iteration and you should then be able to hide most of the latencies. – Paul R Apr 30 '10 at 13:18
  • I've spent quite some time fiddling with this, and the best speedup I can get with SSE2 over my best non-SSE2 implementation is 2.6x for large arrays. I'd be happy to test your implementation, though :-) – Mark Probst Apr 30 '10 at 15:04
  • For large buffers, around 2.5x is the typical speedup I've seen for carefully optimized sse2 code over carefully optimized straight c math. – Alan Apr 30 '10 at 17:13
  • 2
    @Alan: it depends a great deal on what CPU you are using, and also some extent on what compiler. Prior to Woodcrest when SSE2 was only a 64 bit implementation under the hood, SSE speed-ups were modest compared to full 128 bit SIMD implementations such as AltiVec, but from Core 2 Duo onwards you should be able to get 4x improvement for float/int. – Paul R Apr 30 '10 at 19:06
  • @Mark Probst: OK, I've added a simple SIMD implementation to my answer above. It's around `8x` faster than scalar code at best, with array sizes of the order of 100000 and the key value being found half way through the array. It drops off to around `3.5x` for very large arrays. – Paul R May 01 '10 at 17:17
  • For small array sizes this is much slower than my fastest implementations (and slower, actually, than my slowest). For N=1000 it's about as fast as my fastest non-SIMD implementation, but still not even half the speed of my best SSE2 implementation. At N=10000 it's almost as fast as my best SSE2 implementation, but it never catches up fully. GCC 4.2.1 on a Core i7. – Mark Probst May 01 '10 at 20:10
  • 1
    @Mark: I wonder how you're compiling it, and also how you're timing it ? I'm using `gcc -O3` and it's an x86-64 executable (twice as many SSE registers as x86). When I time it I'm doing it in a loop (100 iterations) and taking the minimum time - this means that for all but the first iteration the caches will be primed. If you're just timing one iteration then your measurements will be skewed. And yes, of course the performance will be poor for small arrays - that is expected since the routine evaluates blocks of the array rather than individual elements or vectors. – Paul R May 02 '10 at 07:27
  • @Mark: just one further sanity check: what are your absolute times ? At 2.67 GHz I'm seeing around 1.0 ns / element searched for the scalar code and under 0.15 ns / element searched for my SIMD code (both for the N = 100000 case, where the key is at N / 2, hence times are equal to total time / 50000). – Paul R May 02 '10 at 10:07
  • 0.15ns / element searched means you're doing 2.5 elements/cycle. My code is doing 2.6 elements/cycle. Feel free to try it out yourself http://github.com/schani/linbin - I made a branch "paulr" that includes your implementation. Test linear_sentinel_sse2_nobranch vs linear_sse2_paulr. – Mark Probst May 02 '10 at 11:00
  • @Mark: Thanks - I have some more ideas for further optimisations which I'll try out later today. One involves using SSE4 but I'll #ifdef that code in case you want to restrict this to SSE2. – Paul R May 02 '10 at 13:54
  • @Mark: well my further optimisations didn't help much - I suspect that performance has reached the point where it is limited by cache/memory read bandwidth. This is always the problem when you have very little computation relative to memory I/O. – Paul R May 02 '10 at 15:22
  • 1
    Could well be. Good for us, then, isn't it? :-) – Mark Probst May 02 '10 at 15:25
2

If you had a quantum computer, you could use Grover's algorithm to search your data in O(N1/2) time and using O(log N) storage space. Otherwise, your question is pretty silly. Binary search or one of its variants (trinary search, for example) is really your best choice. Doing micro-optimizations on a linear search is stupid when you can pick a superior algorithm.

kennytm
  • 510,854
  • 105
  • 1,084
  • 1,005
George
  • 2,034
  • 1
  • 15
  • 16
  • 3
    Ok, Mister Smarty-Pants, I have a Core i7 and need to search in an array of size 64, and it needs to be super-fast. Linear or binary? Any further optimizations? – Mark Probst Apr 30 '10 at 02:56
  • 3
    George: For small arrays, cache misses and branch mispredictions will dominate the time to run a binary search. A linear search can use prefetch to eliminate cache misses and will be able to predict most branches. – Gabe Apr 30 '10 at 03:41
  • 1
    Yes, you can do almost everything in constant time, if you just make the constant large enough. But that was not the question. – Mark Probst Apr 30 '10 at 11:07
  • 3
    In theory, a fixed size array is searched in constant time. In theory, there is no difference between theory and practice. In practice, that is not true. – Alan Apr 30 '10 at 17:05
  • @Mark: True, but if you're searching an array of size 64 why go through all this effort to optimize the search? – BlueRaja - Danny Pflughoeft Apr 30 '10 at 20:42
  • 1
    I could ask the same question for any array size, couldn't I? – Mark Probst May 02 '10 at 00:15
2

If you're on an Intel platform:

int linear (const int *array, int n, int key)
{
  __asm
  {
    mov edi,array
    mov ecx,n
    mov eax,key
    repne scasd
    mov eax,-1
    jne end
    mov eax,n
    sub eax,ecx
    dec eax
end:
  }
}

but that only finds exact matches, not greater than or equal matches.

In C, you can also use Duff's Device:

int linear (const int *array, int n, int key)
{
  const int
    *end = &array [n];

  int
    result = 0;

  switch (n % 8)
  {
    do {
  case 0:
    if (*(array++) >= key) break;
    ++result;
  case 7:
    if (*(array++) >= key) break;
    ++result;
  case 6:
    if (*(array++) >= key) break;
    ++result;
  case 5:
    if (*(array++) >= key) break;
    ++result;
  case 4:
    if (*(array++) >= key) break;
    ++result;
  case 3:
    if (*(array++) >= key) break;
    ++result;
  case 2:
    if (*(array++) >= key) break;
    ++result;
  case 1:
    if (*(array++) >= key) break;
    ++result;
    } while(array < end);
  }

  return result;
}
Skizz
  • 69,698
  • 10
  • 71
  • 108
  • 3
    Be careful recommending Duff's device. It's clever C code, for some value of "clever", but because it's extremely unstructured, it can sometimes defeat modern optimizing compilers. – Dale Hagglund Apr 30 '10 at 09:13
  • 1
    @Dale: You're right, modern compilers almost certainly would do a better job of loop unrolling than this. – Skizz Apr 30 '10 at 20:25
  • `repne scasd` has significant startup overhead, and isn't even all that fast compared to SIMD. (`rep stos` and `rep movs` are good (especially for large blocks to amortize their startup overhead), and internally operate in 16-byte or 32-byte chunks, but AFAIK the conditional rep-string instructions (scas and cmps) aren't much more than a scalar loop implemented in microcode.) See [Agner Fog's insn tables and Optimizing Assembly guide](http://www.agner.org/optimize/), and also other links in the [x86 tag wiki](http://stackoverflow.com/tags/x86/info), like Intel's optimization manual. – Peter Cordes Sep 20 '16 at 22:29
  • Update on this: `repne scasd` does *not* have Fast Strings support on any existing CPUs. It does at best 1 DWORD compare per clock after startup, even on recent Skylake / Ryzen CPUs. In 2010 when this answer was posted, Nehalem was current and could do one 16-byte SIMD load per clock. Intel since Haswell, and AMD since Zen2, can do 2x 32-byte loads per clock, along with the SIMD ALU work to compare and check for the key. (Or stgatilov's branchless version just counts to find where the key was). Going to have to downvote this: it's not optimal for anything ever, except possibly code-size. – Peter Cordes Aug 08 '19 at 08:48
2

You can do it in parallel.

If the list is small, maybe it won't be worth to split the search, but if have to process lots of searches, then you can definitively run them in parallel. That wouldn't reduce the latency of the operations, but would improve the throughput.

fortran
  • 74,053
  • 25
  • 135
  • 175
  • 2
    There's almost no way that creating even one thread will be cheaper than a linear scan of 100-200 items. – Dale Hagglund Apr 30 '10 at 09:23
  • Still, if there are going to be lots of searches, those can be done in parallel, and the threads can be in a pool and reused. – fortran Apr 30 '10 at 10:25
  • 1
    In this case, if you are searching <60 items, there is no need for doing it in parallel. However, there are some use cases (I have one now) where an Array of items is not ordered and the order cannot be changed. Binary search cannot be used in this case and if the size of the Array is rather large (it would have to be somewhere around 10,000 to make it worth the extra effort), dividing the array and searching in parallel would definitely be a viable solution – Richard Sep 09 '10 at 11:17
  • Yup, for large arrays you could imagine that different parts of the array can stay hot in the private L2 cache on different cores. For a 64 element array, the synchronization overhead from dispatching a search to a worker thread is higher than just doing it in the thread that wants the result. – Peter Cordes Sep 20 '16 at 22:33
2

You've received many suggestions for improvements, but you need to measure each optimization to see which is best given your hardware and compiler.

As an example of this, in the first version of this response, I guessed that by 100-200 array elements, the slightly higher overhead of binary search should easily be paid for by far fewer probes into the array. However, in the comments below, Mark Probst reports that he sees linear search ahead up to about 500 entries on his hardware. This reinforces the need to measure when searching for the very best performance.

Note: Edited following Mark's comments below on his measurements of linear versus binary search for reasonably small N.

Dale Hagglund
  • 16,074
  • 4
  • 30
  • 37
2

I know that this topic is old, but I could not keep myself from posting. My optimization for a sentinel linear search is:

int sentinel_linear_search(int key, int *arr, int n)
{
    int last_value, i;

    /* considering that n is the real size of the array */
    if (--n < 1)
        return -1;

    last_value = arr[n];

    /* set array last member as the key */
    arr[n] = key;

    i = 0;
    while (arr[i] != key)
        ++i;

    /* recover the real array last member */
    arr[n] = last_value;

    return (arr[i] == key) ? i : -1;
}

The sentinel search great improvement is that its iteration uses only one conditional branch (key) instead of two (index and key).

    while (arr[i] != key)
        ++i;
Geyslan Gregório
  • 1,112
  • 11
  • 13
1

unroll with fixed array indices.

int linear( const int *array, int n, int key ) {
  int i = 0;
  if ( array[n-1] >= key ) {
     do {
       if ( array[0] >= key ) return i+0;
       if ( array[1] >= key ) return i+1;
       if ( array[2] >= key ) return i+2;
       if ( array[3] >= key ) return i+3;
       array += 4;
       i += 4;
     } while ( true );
  }
  return -1;
}
drawnonward
  • 53,459
  • 16
  • 107
  • 112
1

This answer is a little more obscure than my other one, so I'm posting it separately. It relies on the fact that C guarantees a boolean result false=0 and true=1. X86 can produce booleans without branching, so it might be faster, but I haven't tested it. Micro-optimizations like these will always be highly dependent on your processor and compiler.

As before, the caller is responsible for putting a sentinel value at the end of the array to ensure that the loop terminates.

Determining the optimum amount of loop unrolling takes some experimentation. You want to find the point of diminishing (or negative) returns. I'm going to take a SWAG and try 8 this time.

static int
linear (const int *arr, int n, int key)
{
        assert(arr[n] >= key);
        int i = 0;
        while (arr[i] < key) {
                i += (arr[i] < key);
                i += (arr[i] < key);
                i += (arr[i] < key);
                i += (arr[i] < key);
                i += (arr[i] < key);
                i += (arr[i] < key);
                i += (arr[i] < key);
                i += (arr[i] < key);
       }
       return i;
}

Edit: As Mark points out, this function introduces a dependency in each line on the line preceding, which limits the ability of the processor pipeline to run operations in parallel. So lets try a small modification to the function to remove the dependency. Now the function does indeed require 8 sentinel elements at the end.

static int 
linear (const int *arr, int n, int key) 
{ 
        assert(arr[n] >= key);
        assert(arr[n+7] >= key);
        int i = 0; 
        while (arr[i] < key) {
                int j = i;
                i += (arr[j] < key); 
                i += (arr[j+1] < key); 
                i += (arr[j+2] < key); 
                i += (arr[j+3] < key); 
                i += (arr[j+4] < key); 
                i += (arr[j+5] < key); 
                i += (arr[j+6] < key); 
                i += (arr[j+7] < key); 
       } 
       return i; 
} 
Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • Good one, but I don't think it will perform well because it introduces data dependency for the index i, which the more straightforward linear search does not. I'll benchmark it. Also, you need 8 sentinel values, not just one. – Mark Probst Apr 30 '10 at 12:58
  • The data's in - it performs atrociously :-). It's beaten even by a straightforward, non-sentinel, non-unrolled linear search by almost a factor of 2. – Mark Probst Apr 30 '10 at 13:08
  • Oh well, it was worth a shot. And you still only need one sentinel, because the index stops incrementing as soon as you reach it. – Mark Ransom Apr 30 '10 at 13:28
  • @Mark Probst, try my newest wrinkle. – Mark Ransom Apr 30 '10 at 16:17
  • Much better. About 30% faster than the bog-standard linear search, but still only about half the speed of unrolled linear search with sentinel. My code is now online at http://github.com/schani/linbin/ - feel free to play around with it. – Mark Probst Apr 30 '10 at 16:49
1

You could avoid n checks similar to how loop unrolling does it

static int linear(const int *array, int arraySize, int key)
{
  //assuming the actual size of the array is always 1 less than arraySize
  array[arraySize] = key; 

  int i = 0;
  for (; ; ++i)
  {
     if (array[i] == key) return i;
  }
}
Sevki
  • 3,592
  • 4
  • 30
  • 53
archon
  • 11
  • 1
  • If there is no element similar to key you'll read out of bounds. To use one branch conditional it's necessary to set the last (or first if inverse) array element. See my answer: http://stackoverflow.com/a/33972674/2776344 – Geyslan Gregório Nov 28 '15 at 15:27
0

loop backwards, this might be translated...

// loop backward

for (int i = arraySize - 1; i >=0; --i)

...to this( "could be" faster ):

    mov cx, arraySize - 1
detectionHere:
    ...
    loop detectionHere   

Other than that, only binary search can make searching faster

Michael Buen
  • 38,643
  • 9
  • 94
  • 118
0

this might force vector instructions (suggested by Gman):

for (int i = 0; i < N; i += 4) {
    bool found = false;   
    found |= (array[i+0] >= key);
    ...
    found |= ( array[i+3] >= key);
    // slight variation would be to use max intrinsic
    if (found) return i;
}
...
// quick search among four elements

this also makes fewer branch instructions. you make help by ensuring input array is aligned to 16 byte boundary

another thing that may help vectorization (doing vertical max comparison):

for (int i = 0; i < N; i += 8) {
    bool found = false;   
    found |= max(array[i+0], array[i+4]) >= key;
    ...
    found |= max(array[i+3], array[i+7] >= key;
    if (found) return i;
}
// have to search eight elements
Anycorn
  • 50,217
  • 42
  • 167
  • 261
  • @the_drow basically, you hoping to use vector instructions to do 4x things in one time. many compilers can be forced to use such instructions. in the first, you loading 4 elements, in the second, you loading eight elements, and eliminate half by using vector max function. the result is range in which index is located (four or eight elements long).after this you have to search small range for exact index – Anycorn Apr 30 '10 at 03:22
0

You could search for a larger element than an int at a time - platform specifically, this can be much faster or slower depending on how it handles the larger data read. For instance, on a 64-bit system, reading in the array 2 elements at a time and checking the hi/low elements seperately could run faster due to less I/O. Still, this is an O(n) variety sort no matter what.

Michael Dorgan
  • 12,453
  • 3
  • 31
  • 61
0

In one of the comments you said the array length is 64.

Well if you must do it linearly, you can do:

int i = -1;
do {
  if (arr[0] >= key){i = 0; break;}
  if (arr[1] >= key){i = 1; break;}
  ...
  if (arr[62] >= key){i = 62; break;}
  if (arr[63] >= key){i = 63; break;}
} while(0);

However, I seriously doubt if that is faster than this binary search: *

int i = 0;
if (key >= arr[i+32]) i += 32;
if (key >= arr[i+16]) i += 16;
if (key >= arr[i+ 8]) i +=  8;
if (key >= arr[i+ 4]) i +=  4;
if (key >= arr[i+ 2]) i +=  2;
if (key >= arr[i+ 1]) i +=  1;

*Thanks to Jon Bentley for that one.

Added: since you said this table is prepared once for a large number of searches, and you want fast, you could allocate some space somewhere and generate machine code with the values hard-wired into it. It could either be linear or binary search. If binary, the machine code would look like what the compiler would generate from this:

if (key < value32){
  if (key < value16){
    ...
  }
  else {
    ...
  }
}
else {
  if (key < value48){
    ...
  }
  else {
    ...
  }
}

Then you just copy that into a place where you can call it.

OR you could print the code above, compile and link it on the fly into a dll, and load the dll.

Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135
0
uint32 LinearFindSse4( uint8* data, size_t data_len, uint8* finddata, size_t finddatalen )
{
    /**
     * the following is based on...
     * #define haszero(v) (((v) - 0x01010101UL) & ~(v) & 0x80808080UL)
     * we split it into 2 sections
     * first section is:
     * (v) - 0x01010101UL)
     *
     * second section is:
     * ~(v) & 0x80808080UL)
     */
    __m128i ones = _mm_set1_epi8( 0x01 );
    __m128i eights = _mm_set1_epi8( 0x80 );
    __m128i find_field = _mm_set1_epi8( finddata[0] );

    uint32 found_at = 0;
    for (int i = 0; i < data_len; i+=16)
    {
#define CHECKTHIS( n ) if (!memcmp(&data[i+n], &finddata[0], sizeof(finddata))) { found_at = i + n; break; }

        __m128i chunk = _mm_stream_load_si128( (__m128i *)&data[i] );
        __m128i xor_result = _mm_xor_si128( chunk, find_field );
        __m128i first_sec = _mm_sub_epi64( xor_result, ones );
        __m128i second_sec = _mm_andnot_si128( xor_result, eights );

        if(!_mm_testz_si128(first_sec, second_sec))
        {
            CHECKTHIS(0);
            CHECKTHIS(1);
            CHECKTHIS(2);
            CHECKTHIS(3);
            CHECKTHIS(4);
            CHECKTHIS(5);
            CHECKTHIS(6);
            CHECKTHIS(7);
            CHECKTHIS(8);
            CHECKTHIS(9);
            CHECKTHIS(10);
            CHECKTHIS(11);
            CHECKTHIS(12);
            CHECKTHIS(13);
            CHECKTHIS(14);
            CHECKTHIS(15);
        }
    }
    return found_at;
}
-1

In reality, the answer to this question is 100% dependent on the platform you're writing the code for. For example:

CPU : Memory speed | Example CPU | Type of optimisation
========================================================================
    Equal          |    8086     | (1) Loop unrolling
------------------------------------------------------------------------
  CPU > RAM        |  Pentium    | (2) None
  1. Avoiding the conditional branch required to loop though the data will give a slight performance improvement.
  2. Once the CPU starts to get faster than the RAM, it doesn't matter how efficient the loop becomes (unless it's a really bad loop), it will be stalling due to having to wait for the data to be brought in from RAM. SIMD doesn't really help since the advantage of parallel testing is still outweighed by having to wait for more data to arrive. SIMD really comes into its own when you're CPU limited.
Skizz
  • 69,698
  • 10
  • 71
  • 108
  • The data (http://schani.wordpress.com/2010/04/30/linear-vs-binary-search/) disagrees with your theory of reality. – Mark Probst May 01 '10 at 00:05
  • @Mark: Your method appears to eliminate RAM overhead by throwing away the two slowest times, so you're effectively measuring the algorithm, not the whole system. After a couple of runs, the array will be loaded into L1 and L2 cache and be reasonably quick to access. It would be interesting to see the two slowest times included in the timings - if you could guarantee the data was in RAM and not any cache then the algorithm would have less of an effect on the timings. – Skizz May 01 '10 at 21:31
  • 1
    I'm not throwing away the two slowest individual search times - I can't time a search that takes only a handful of cycles. I do, say, the same 20 million random searches, 10 times over, and throw away the times for the two slowest and the two fastest of those 10 runs. I average the 6 that remain and divide the average time by 20 million to get the average time for one individual search. If you know how to reliably time such a search from RAM, i.e. with "empty" L2 and L3 caches, please let me know. – Mark Probst May 02 '10 at 00:19
  • On a quad-core i7, a single core can nearly saturate DRAM bandwidth. On a typical Haswell or Skylake, that's something like 8 bytes per core clock cycle so yes you do need SIMD to keep up even with DRAM, let alone L3 cache. In a program where about optimizing this search is worthwhile, it probably runs enough to stay hot in at least L3, probably L2. Wider SIMD means more work in fewer uops so it helps keep more cache misses in flight (the same out-of-order window can "see" more bytes ahead to trigger page walks and cache misses earlier; HW data prefetch usually stops at 4k boundaries.) – Peter Cordes Aug 08 '19 at 08:57
  • I think people have misunderstood my answer. For a linear search, the algorithm is constrained by the speed data can be fetched from RAM (or disk for really big arrays), once you reach peak data transfer rate then improving the algorithm will make little difference to the overall speed. Yes, changing the algorithm could make it faster by reducing the amount of data being moved through the system, but the question did say "only linear search". – Skizz Aug 15 '19 at 10:02
-5

Well, you could use pointers...

static int linear(const int *array, int arraySize, int key) {
    int i;

    for(i = 0; i < arraySize; ++i) {
        if(*array >= key) {
            return i;
        }

        ++array;
    }

    return arraySize;
}
strager
  • 88,763
  • 26
  • 134
  • 176
  • Yeah, but the compiler will probably optimize that bit anyway. You could also try loop unrolling. – BobbyShaftoe Apr 30 '10 at 02:01
  • 1
    Look at the output from your compiler on that one, it's probably the same as the OP's code. (gcc has been doing this optimization since <2.95, which is where I first noticed it.) The "counter" variable will be initialized to n and each time through the loop the counter is decremented while the pointer is advanced by 4 (or whatever sizeof(int) returns). – dash-tom-bang Apr 30 '10 at 02:01
  • I don't think this helps at all. It just means you're incrementing an extra variable every loop. Unless dereferencing a pointer is faster than array[i]... – Chris Cooper Apr 30 '10 at 02:01
  • 1
    @Shaftoe, Yes; this kind of microoptimization I have a hard time doing with a clean conscience. – strager Apr 30 '10 at 02:03
  • @GMan: Just about any compiler that offers code optimizations will reduce the counter + array index to pointer arithmetic in the generated code. – dthorpe Apr 30 '10 at 02:04
  • Last time I tested on X86, array[i] was faster than *array because you only have to do one increment instead of two. – Mark Ransom Apr 30 '10 at 02:04
  • This will make your code slower at worst, because you're doing one extra increment per iteration. If the compiler's smart, it will give the same performance as the original code. – Mark Probst Apr 30 '10 at 02:08
  • @GMan - The code has changed since I made my comment, so it is no longer applicable. – Chris Cooper Apr 30 '10 at 03:10
  • @GMan - Unless it took me a while to actually write the comment I assume...? Anyway, other people have said similar things. I re-read the post, and my comment IS still applicable. But I agree with Romain... Too much hair-splitting here. =P – Chris Cooper Apr 30 '10 at 04:28
  • @GMan - Agreed. And regardless of if it's changed or not, the change I testify to was simply a moving of the "++array" from the for loop () to the end of the loop, which has no function difference. =P – Chris Cooper Apr 30 '10 at 05:16
  • If you want to help the compiler optimize away a separate integer inside the loop, remove the `i` variable. Use `return p - array` to calculate the length from a pointer subtraction if you want to actually hand-hold the compiler into making a tighter inner loop. Unless you show compiler output showing that this makes a nicer inner loop even though you still have an `i++` as well as `array++`. – Peter Cordes Aug 08 '19 at 09:01