In C++, which is faster? (2 * i + 1) or (i << 1 | 1)?

Question

I realize that the answer is probably hardware specific, but I'm curious if there was a more general intuition that I'm missing?

I asked this question & given the answer, now I'm wondering if I should alter my approach in general to use "(i << 1|1)" instead of "(2*i + 1)"??

I don't know for sure, but it probably works out to the same machine instructions... so I'd say go for whichever one is more readable. — Jon Seigel, Dec 07 '10 at 04:47
@Jon Seigel: And "readable" means that which more clearly expresses the intent of the code. Are you (the OP) multiplying by two and adding one, or are you shifting left and setting the LSB? — jason, Dec 07 '10 at 04:55
You are trying to do a job that the compiler would do. So u better not.^^ — pinichi, Dec 07 '10 at 04:57
I find the first version faster to read. The second version takes a bit of thought about what you are trying to achieve. As a result I would always use the first one `As it is the fastest to understand`. — Martin York, Dec 07 '10 at 05:54
http://stackoverflow.com/questions/183201/should-a-developer-aim-for-readability-or-performance-first — JohnMcG, Dec 07 '10 at 17:20

score 13 · Answer 1 · answered Dec 07 '10 at 04:50

13

Since the ISO standard doesn't actually mandate performance requirements, this will depend on the implementation, the compiler flags chosen, the target CPU and quite possibly the phase of the moon.

These sort of optimisations (saving a couple of cycles) almost always pale into insignificance in terms of return on investment, against macro-level optimisations like algorithm selection.

Aim for readability of code first and foremost. If your intent is to shift bits and OR, use the bit-shift version. If your intent is to multiply, use the * version. Only worry about performance once you've established there's an issue.

Any decent compiler will optimise it far better than you can anyway :-)

answered Dec 07 '10 at 04:50

paxdiablo

854,327
234
1,573
1,953

1

Hopefully the compiler is not dependent on moon phase, though now that I think about it, I have worked with a few that do seem to dependent on tidal characteristics? – Martin York Dec 07 '10 at 06:16
as in they get flooded at high tide? I might recommend moving the server to a higher altitude... ;) – jalf Dec 07 '10 at 14:36
I've been pretty disappointed by compilers failing to use bit shifts/adds to optimize multiplication. – Brian Knoblauch Dec 07 '10 at 15:58
1

@Knoblauch have you profiled the performance? Maybe using multiply allows the CPU microcode to use SIMD/SSE2 instructions to do it faster than a bit shift? – Martin Beckett Dec 07 '10 at 17:05
Not to mention preceding instructions. Many processors can execute multiple operations in parallel, but not multiple of the same type. Therefore, it makes sense to use a real multiply if the previous operation was a bit shift. You can even get the counterintuitive result that `a *= 2; b*= 2` uses two different operations, _precisely_ because they're different! – MSalters Dec 08 '10 at 13:19

score 8 · Accepted Answer · answered Dec 07 '10 at 16:14

Just an experiment regarding answers given about "... it'll use LEA":
The following code:

int main(int argc, char **argv)
{
#ifdef USE_SHIFTOR
return (argc << 1 | 1);
#else
return (2 * argc + 1);
#endif
}

will, with gcc -fomit-frame-pointer -O8 -m{32|64} (for 32bit or 64bit) compile into the following assembly code:

x86, 32bit:

080483a0 <main>:
80483a0:    8b 44 24 04             mov    0x4(%esp),%eax
80483a4:    8d 44 00 01             lea    0x1(%eax,%eax,1),%eax
80483a8:    c3                      ret

x86, 64bit:

00000000004004c0 <main>:
4004c0: 8d 44 3f 01             lea    0x1(%rdi,%rdi,1),%eax
4004c4: c3                      retq

x86, 64bit, -DUSE_SHIFTOR:

080483a0 <main>:
80483a0:    8b 44 24 04             mov    0x4(%esp),%eax
80483a4:    01 c0                   add    %eax,%eax
80483a6:    83 c8 01                or     $0x1,%eax
80483a9:    c3                      ret

x86, 32bit, -DUSE_SHIFTOR:

00000000004004c0 <main>:
4004c0: 8d 04 3f                lea    (%rdi,%rdi,1),%eax
4004c3: 83 c8 01                or     $0x1,%eax
4004c6: c3                      retq

In fact, it's true that most cases will use LEA. Yet the code is not the same for the two cases. There are two reasons for that:

addition can overflow and wrap around, while bit operations like << or | cannot
(x + 1) == (x | 1) only is true if !(x & 1) else the addition carries over to the next bit. In general, adding one only results in having the lowest bit set in half of the cases.

While we (and the compiler, probably) know that the second is necessarily applicable, the first is still a possibility. The compiler therefore creates different code, since the "or-version" requires forcing bit zero to 1.

Nice to see someone does actually put speculations and wild assumptions to the test. But you explanation why gcc doesn't optimize the shift version is wrong: Your point 1 is invalid, a x<<1 does wrap in exactly the same way as x+x for every x. Also a recent enough compiler will optimize the shift version to the very same lea instruction. — Gunther Piez, Aug 19 '12 at 10:31
@drhirsch: stand corrected ;-) you're right, done the test, gcc 4.7.2 creates the same code for 32/64bit no matter the exact formulation of the source. — FrankH., Aug 20 '12 at 08:58

Crashworks · Answer 3 · 2010-12-07T04:54:01.910

5

Any but the most brain-dead compiler will see those expressions as equivalent and compile them to the same executable code.

Typically it's not really worth worrying too much about optimizing simple arithmetic expressions like these, since it's the sort of thing compilers are best at optimizing. (Unlike many other cases in which a "smart compiler" could do the the right thing, but an actual compiler falls flat.)

This will work out to the same pair of instructions on PPC, Sparc, and MIPS, by the way: a shift followed by an add. On the ARM it'll cook down to a single fused shift-add instruction, and on x86 it'll probably be a single LEA op.

edited Dec 07 '10 at 04:54

answered Dec 07 '10 at 04:48

Crashworks

40,496
12
101
170

Coudln't this be compiled to a single LEA on x86? – Axel Gneiting Dec 07 '10 at 04:51
2

Yes, probably is the `LEA EAX, EAX + EAX + 1` the fastes way under x86. – GJ. Dec 07 '10 at 09:53

Jonathan Sternberg · Answer 4 · 2010-12-07T07:46:44.467

4

Output of gcc with the -S option (no compiler flags given):

.LCFI3:
        movl    8(%ebp), %eax
        addl    %eax, %eax
        orl     $1, %eax
        popl    %ebp
        ret

.LCFI1:
        movl    8(%ebp), %eax
        addl    %eax, %eax
        addl    $1, %eax
        popl    %ebp
        ret

I'm not sure which one is which, but I don't believe it matters.

If the compiler does no optimizations at all, then the second would probably translate to faster assembly instructions. How long each instruction takes is completely architecture-dependent. Most compilers will optimize them to be the same assembly-level instructions.

edited Dec 07 '10 at 07:46

answered Dec 07 '10 at 04:54

Jonathan Sternberg

6,421
7
39
58

Actually, you _can't_ say that, in general, the second will be fastest since it's quite possible to have an architecture where adds are ten times the speeds of shifts (unlikely, but my point is that it's platform-dependent). If you're limiting yourself to a specific platform, that may be the case but you should probably make that clear in the answer. – paxdiablo Dec 07 '10 at 07:27
1

And remember the proverb: Benchmarking without -O3 is like comparing F1 drivers in how fast they can go on skateboards. – Kos Dec 07 '10 at 11:20

score 1 · Answer 5 · answered Aug 19 '12 at 10:27

1

I just tested this with gcc-4.7.1 using the source of FrankH, the code generated is

lea    0x1(%rdi,%rdi,1),%eax
retq

no matter if the shift or the multiplication version is used.

answered Aug 19 '12 at 10:27

Gunther Piez

29,760
6
71
103

score 0 · Answer 6 · answered Dec 07 '10 at 06:02

0

Nobody cares. Nor should they.
Quit worrying about that and get your code correct, simple, and done.

answered Dec 07 '10 at 06:02

Stephen Hazel

859
2
12
27

1

Can we be less negative, or at least support your statement by saying "the compiler will treat the two forms equivalently"? – Seth Johnson Dec 12 '10 at 00:58
ok, ok, sorry. how about "you should probaby be writing hand crafted assembler if you care about speed in this detail"? no? In general when writing cpp I strive for correctness, simplicity and DONE. If optimized doesn't follow from simplicity then you're just begging the next poor slob to pick up this code to hunt you down and shoot ya... – Stephen Hazel Dec 15 '10 at 20:30

Abyx · Answer 7 · 2012-08-19T00:47:04.367

0

i + i + 1 may be faster than other two, because addition is faster than multiplication and can be faster than shift.

edited Aug 19 '12 at 00:47

answered Dec 07 '10 at 11:18

Abyx

12,345
5
44
76

This answer is not helpful because it's a baseless guess, with not even a hint of profiling or disassembly to back it up. It encourages people to "micro-optimize", which, as other answers have stated, is wrong. – Seth Johnson Dec 12 '10 at 00:15

score -2 · Answer 8 · answered Dec 07 '10 at 14:11

The faster is the first form (the one with the shift right), in fact the shr instruction takes 4 clock cycles to complete in the worst case, while the mul 10 in the best case. However, the best form should be decided by the compiler because it has a complete view of the others (assembly) instructions.

In C++, which is faster? (2 * i + 1) or (i << 1 | 1)?

8 Answers8