2

When working with two dimensional arrays with a width that is a power of two, there are multiple ways to calculate the index for a (x, y) pair:

given

size_t width, height, stride_shift;
auto array = std::vector<T>(width*height);
size_t x, y;

assert(width == ((size_t)1 << stride_shift));
assert(x < width)
assert(y < height);

1st way - Addition

size_t index = x + (y << stride_shift);

2nd way - Bitwise OR

size_t index = x | (y << stride_shift);

3rd way - Bitwise XOR

size_t index = x ^ (y << stride_shift);

It is then used like this

auto& ref = array[index];
...

All yield the same result, because a+b, a|b and a^b result in the same value, as long as not both bits are set at the same bit position. It works because the index is just a concatenation of the bits of x and y:


                    lowest  bits
                     v        v
index = 0b...yyyyyyyyyxxxxxxxxx
                      \___ ___/
                          '
                  stride_shift bits

So the question is, what is potentially faster in x86(-64)? What is recommended for readable code?

  • Could there be a speedup by using +, because the compiler might use both the ADD and LEA instruction, such that multiple additions can be performed at the same time on both the address calculation unit and the ALU.
  • Could OR and XOR be potentially faster? Because the calculation of every new bit is independent of all others, that means no bit is dependent on the carry of lower bits.
  • Consider a nested for loop, y is iterated on the outside. Are there compilers that do an optimization in loops where row_pointer = array.data()+(y << stride_shift) is calculated for each iteration of y. And then pointer = row_pointer+x. I think this optimization wouldn't be possible using | or ^.
  • Mixing bitwise operations and arithmetic operations is sometimes considered bad code, so should one choose | to go with the <<?

As a bonus: How does this apply in general/to other architectures?

njuffa
  • 23,970
  • 4
  • 78
  • 130
cmdLP
  • 1,658
  • 9
  • 19
  • 6
    Have you benchmarked it? What was the result? – Alan Birtles Apr 11 '21 at 18:38
  • 2
    `LEA` also provides the benefit of leaving the output in a third register instead of needing an additional `MOV`, so that's another advantage of `+`. Plus if `stride_shift` happens to be a compile-time constant with value 2, 4 or 8, the shift can be done in the `LEA` as well. – Nate Eldredge Apr 11 '21 at 18:45
  • 1
    Are you aware of [Compiler Explorer](https://godbolt.org)? Recommended way to explore code generation questions with commonly used tool chains (make sure you try Clang). Are you concerned about energy efficiency at all? `OR` and `XOR` are likely to require less energy than `ADD`. – njuffa Apr 11 '21 at 18:47
  • 5
    BTW: just because the instruction is called `lea`, does not mean an AGU does the computation. That is one possible implementation, but on all modern x86 `lea` competes with normal ALU instructions (of course, modern x86 can execute plenty of *those* in parallel, so it's no big deal). `lea` is useful, but you shouldn't use "it's computed on the AGU" as reason to use it. – harold Apr 11 '21 at 18:48
  • 2
    Although one might imagine that `OR/XOR` need less circuitry than `ADD` and could therefore be faster, in practice I think they will all be single-cycle uops on any modern CPU. – Nate Eldredge Apr 11 '21 at 18:48
  • 1
    `lea` will only be useful if the `stride_shift` is `1`, `2`, `4` or `8`. On ARM, all 3 of `+`, `^` and `|` can use a pre-shifted argument, so all 3 could be done in a single instruction. If you're iterating over all elements of your 2-dimensional array, you're better off just adding the inner stride increment every time, so you need fewer registers, only add an immediate in every step, get better caching and prefetching. – EOF Apr 11 '21 at 20:27

1 Answers1

2

I would expect no difference between the 3 versions; on modern x86 architectures +, | and ^ all have identical latency of 1, throughput 0.5 (including the LEA r+r*s variant) and are all executed on the ALU.

So I would concentrate on code readability instead.

x + (y << stride_shift) definitely looks more readable (perhaps even better: x + y * width, which should get optimized into a shift).

But just to be sure, we can benchmark it real quick.

Test results from quick-bench (Clang 11 results, GCC is acting up at the moment):

The apparent speedup in the + version seems to be due to 1 less MOV thanks to the LEA's output register (see godbolt link). But that's probably just an artifact of the contrived test. In real-life code there would be plenty of room for the optimizer to hide this latency.

But anyway, it seems the + version wins either way.

rustyx
  • 80,671
  • 25
  • 200
  • 267
  • 1
    `add`, `xor` and `and` instructions all have the same 4/clock throughput and 1 cycle latency. `lea` has 2/clock throughput on Intel before Ice Lake. On Zen and Ice Lake, 4/clock throughput for addressing modes with scale-factor = 1, otherwise only 2/clock. (Before Ice Lake, e.g. Skylake, "slow lea" was one with 3 components, two `+` operations, and has 1/clock throughput, 3c latency.) [Assembly why lea is fast?](https://stackoverflow.com/a/67038018) – Peter Cordes Apr 11 '21 at 20:35
  • 1
    But a C `+` operator can be done as part of an addressing mode instead of with a separate instruction if you're actually dereferencing the result. Or with LEA to be non-destructive. That's the real key - C compilers don't naively transliterate operators to instructions. Note that `mov reg,reg` has zero latency thanks to register renaming (AMD since Zen, Intel from IvyBridge until pre-Ice Lake: ICL disabled mov-elimination to mitigate an erratum); the cost is in throughput for the front-end to get it into the pipeline. – Peter Cordes Apr 11 '21 at 20:35
  • Worth noting the extra mov is dependent on arch. If [BMI2](https://en.wikipedia.org/wiki/Bit_manipulation_instruction_set) is available it will use shlx which suppliments the need for a mov. See [godbolt link](https://gcc.godbolt.org/z/GPvEWzfes). – Noah Apr 12 '21 at 21:27