Why is address calculation for array element lengths divisible by powers of 2 more efficient?

Question

I was studying in depth about pointers as I don't think I have good knowledge about pointers and came across the following line on Wikipedia:

When dealing with arrays, the critical lookup operation typically involves a stage called address calculation which involves constructing a pointer to the desired data element in the array. If the data elements in the array have lengths that are divisible by powers of two, this arithmetic is usually much more efficient.

Why is this so?

The above line is written under the heading "Uses"

That should probably have a citation needed flag on wikipedia — norlesh, Jan 01 '14 at 10:43
Multiply isn't really that much slower than shifting nowadays. Unless that statement is referring to the `lea` instruction. But that only does very small powers of two. (at most 4 or 8 I think) — Mysticial, Jan 01 '14 at 10:44
I agree that the difference negligible in state of the art processors. But, apart from those processors, there is a vast variety of computing device such as low end micro controllers that run programs. In most of these devices the difference between a shift and multiplication is significant. — Buddhima Gamlath, Jan 01 '14 at 11:15
Possibly related but with the opposite conclusion - http://stackoverflow.com/questions/11413855/why-is-transposing-a-matrix-of-512x512-much-slower-than-transposing-a-matrix-of/11413856#11413856 — Leeor, Jan 01 '14 at 12:00

Mats Petersson · Accepted Answer · 2014-01-01T11:05:17.857

12

Multiply by 2ⁿ is done by shifting left. Modern processors can do shift in a single cycle (and in x86, for small shifts up to 8 or 16, built into the address calculation itself). A regular multiply operation takes 4-10 clock cycles on AMD64 machines, and most likely similar on modern Intel processors. There are also restrictions to how "close together" two consecutive multiply operations can be done.

Of course, if the size of the array is quite large, it may be more efficient to use multiply instruction and pack the data in more tightly (not using padding to expand the data to a power of 2 size), because of cache efficiency

Of course, modern compilers are clever, so if you need to multiply by X by 12, the compiler will generate (X << 3) + (X << 2), for example, which is faster than a single multiply operation.

edited Jan 01 '14 at 11:05

answered Jan 01 '14 at 10:58

Mats Petersson

126,704
14
140
227

what do you mean with "pack the data in more tightly" ? – user2485710 Jan 01 '14 at 11:02
3

Ever since Core 2, integer multiply on Intel processors have been 3 cycle latency with throughput 1/cycle. Shift is 1 cycle latency and 2/cycle. So yes, shift is still faster. But not by much. – Mysticial Jan 01 '14 at 11:02
@user2485710: I have edited to explain more clearly "pack data in more tightly". Mysticial: Yes, I think the 1/cycle is more of an issue than the latency - particularly in two-dimensional arrays where both the innermost dimension and the element size are non-2^n. – Mats Petersson Jan 01 '14 at 11:07
@Mysticial not according to [this](https://gmplib.org/~tege/x86-timing.pdf), shl has 4 times more throughput and 5 times less latency than imul for 64-bit on core 2 – Esailija Jan 01 '14 at 16:13
@Esailija Then either [Agner Fog](http://www.agner.org/optimize/instruction_tables.pdf) is wrong or GMP is wrong. I got my numbers from Agner Fog's tables. – Mysticial Jan 01 '14 at 20:52
@Mysticial It's not wrong, it lists exactly the same differences. Maybe you are not looking at 64-bit operations? Look at IMUL for r64, r64 – Esailija Jan 02 '14 at 00:25
@Esailija You're right. I was looking at the 32-bit multiply. But it looks like even the 64-bit multiply went down in Nehalem. – Mysticial Jan 02 '14 at 01:50

score 3 · Answer 2 · edited Jan 01 '14 at 10:47

3

The address calculation of i'th element involves base + size_of_element * i.

When the size of element is a power of 2, say size_of_element = 2^m, then this can be achieved with base + (i << m).

The shifting is much more efficient compared to the multiplication involved in the earlier calculation.

edited Jan 01 '14 at 10:47

user207421

305,947
44
307
483

answered Jan 01 '14 at 10:46

Buddhima Gamlath

2,318
1
16
26

3

@ps06756 no, that quote is about element size. – harold Jan 01 '14 at 10:49
@KillanDS 'lea' does the addition as well as the offset calculation. It isn't comparable to either the multiply or the shift. I tested multiply versus shift in the early 1980s and it was ten times as slow. Again, it is not a controversial statement. If you can produce a system on which it isn't less costly, please do so. – user207421 Jan 01 '14 at 11:03
@KillianDS I didn't say you shouldn't *consider* it, I said you shouldn't *compare* it. Apples and oranges. You're in no position to be accusing others of spreading half-truths. I further note that you kept asking for citations while producing precisely none of your own, even when asked. – user207421 Jan 01 '14 at 11:08

score 2 · Answer 3 · answered Jan 01 '14 at 10:45

2

Because the offset can be computed with a left shift instead of multiplication, but I'd also say the remark is probably a decade or two out of date, given the amount of pipelining in CPUs.

answered Jan 01 '14 at 10:45

user207421

305,947
44
307
483

norlesh · Answer 4 · 2014-01-01T10:55:03.253

2

When multiplying such as would have to happen to find the Nth member in the array, when your dealing with powers of two you can use shifting operations which are lest costly than a full blown multiplication operation on some systems.

edited Jan 01 '14 at 10:55

answered Jan 01 '14 at 10:46

norlesh

1,749
11
23

1

I'm just explaining what is written on the wikipedia page as asked - its up to the original author to provide the references – norlesh Jan 01 '14 at 10:52
@KillianDS I disagree. It's hardly a controversial statement. There's a difference between 'less costly' and 'not notice the difference'. – user207421 Jan 01 '14 at 10:56
I have reworded the answer to say 'in some systems' ... how would you answer this question? – norlesh Jan 01 '14 at 10:56
1

Weasel words :) Just cite something, like this: http://instlatx64.atw.hu/ Shift is obviously consistently faster than `imul`. `lea` is at worst as bad as an `imul`, and you'd have to do more than that `imul`, so `lea` still wins. As for whether that actually makes a difference, that's something else. – harold Jan 01 '14 at 10:58
Can I just point out at this point that including any of this information in the answer would not have helped the person asking the question :) – norlesh Jan 01 '14 at 11:04

Why is address calculation for array element lengths divisible by powers of 2 more efficient?

4 Answers4