9

This question is about the efficiency of a linear search vs. the efficiency of a binary search for a pre-sorted array in contiguous storage...

I have an application written in fortran (77!). One frequent operation for my part of the code is to find the index in an array such that gx(i) <= xin < gx(i+1). I've currently implemented this as a binary search -- sorry for the statement labels and goto -- I've commented what the equivalent statments would be using fortran 90...

        i=1
        ih=nx/2
201     continue  !do while (.true.)
           if((xin.le.gx(i)).and.(xin.gt.gx(i+1)))then  !found what we want
              ilow=i+1; ihigh=i
              s1=(gx(ihigh)-xin)/(gx(ihigh)-gx(ilow))
              s2=1.0-s1
              return
           endif
           if(i.ge.ih)then
              goto 202 !exit
           endif
           if(xin.le.(gx(ih))then !xin is in second half of array
              i=ih
              ih=nx-(nx-ih)/2
           else !xin is in first half of array
              i=i+1
              ih=i+(ih-i)/2
           endif
        goto 201  !enddo

However, today, I was reading on Wikipedia about binary search and I came across this:

Binary search can interact poorly with the memory hierarchy 
(i.e. caching), because of its random-access nature. For 
in-memory searching, if the span to be searched is small, a
linear search may have superior performance simply because 
it exhibits better locality of reference.

I don't completely understand this statement -- my impression was that cache fetches were gathered in large(ish) chunks at a time, so if we start at the beginning of the array, I thought that most of the array would be in cache already (at least as much as it would be for a linear search), so I didn't think that would matter.

So my question is, is there any way to tell which algorithm will perform better (linear or binary search?) Is there an array size boundary? I'm currently using arrays of size around 100 elements...

mgilson
  • 300,191
  • 65
  • 633
  • 696

1 Answers1

13

For small arrays, the problem is not cache. You are right: A small array is likely to be cached quickly.

The problem is that branch prediction is likely to fail for binary search because branches are taken or skipped at random in a data-dependent way. Branch prediction misses stall the CPU pipeline.

This effect can be severe. You can easily search 3 to 8 elements linearly in the same time it takes to do a single binary search branch (and you need to do multiple binary search branches). The exact break even point needs to be measured.

Stalling the CPU pipeline is extremely expensive. A Core i7 can retire up to 4 instructions per clock cycle (12 giga-instructions per second at 3 GHz!). But only, if you are not stalling.

There are branch-free algorithms doing binary search by using conditional-move CPU instructions. These algorithms basically unroll 32 search steps and use a CMOV in each step (32 steps are the theoretical maximum). They are branch-free but not stall free: Each next step depends 100% on the previous one so the CPU cannot charge ahead in the instruction stream. It has to wait all the time. So they don't solve this problem, only improve it slightly.

usr
  • 168,620
  • 35
  • 240
  • 369
  • Thanks for this. It's very informative. I take it from "the exact break even point needs to be measured" that there's no rule of thumb for how to deal with these things (when to search linearly vs. with a binary search)? – mgilson May 10 '12 at 00:29
  • Exactly. It even depends on the CPU. For big lists (or many small lists!), it depends on the cache size of course. I'd just setup a micro-benchmark and measure binary search vs. linear search over N elements. – usr May 10 '12 at 09:04
  • Also modern MMUs will tend to prefetch data in from RAM if they see you accessing the n, n+1, n+2th element in sequence, but can't tell what you're doing if you access in a seemingly random pattern like binary search outputs. – SecurityMatt May 10 '12 at 09:10
  • "branch prediction is impossible" - it's not impossible - it's just wrong 50% of the time :) – SecurityMatt May 10 '12 at 09:11
  • @SecurityMatt, true :) I improved the text a bit. – usr May 10 '12 at 10:05
  • Is there an advantage to searching the array (linearly) at a pre-defined stride (e.g. every 3rd element) and then searching the final (very small) array? This would have a much more regular memory access pattern and cut your searches down by a factor of approx. N (where N is the stride). Or does that have problems as well? – mgilson May 10 '12 at 12:51
  • Hm what a nice idea. CPUs can detect stride because that is very common. But I guess you'd need to do at least 3 operations before the CPU picks up the pattern. I'd definitely try it though. – usr May 10 '12 at 13:31
  • also sometimes the compiler can vectorize the search -- e.g. in f90 smth like `index=count(gx<=xin)+1` can be quite fast if the elements are short (int/float) -- you may be able to trick your f77 compiler into doing the same. – laxxy May 15 '12 at 00:48