5

If I write

int main(int argc, char *argv[])
{
    int temp[50][3];
    return &temp[argc] - &temp[0];
}

and compile it with Visual C++, I get back:

009360D0 55                   push        ebp  
009360D1 8B EC                mov         ebp,esp  
009360D3 8B 45 08             mov         eax,dword ptr [argc]  
009360D6 8D 0C 40             lea         ecx,[eax+eax*2]  
009360D9 B8 AB AA AA 2A       mov         eax,2AAAAAABh  
009360DE C1 E1 02             shl         ecx,2  
009360E1 F7 E9                imul        ecx  
009360E3 D1 FA                sar         edx,1  
009360E5 8B C2                mov         eax,edx  
009360E7 C1 E8 1F             shr         eax,1Fh  
009360EA 03 C2                add         eax,edx  
009360EC 5D                   pop         ebp  
009360ED C3                   ret  

Why am I getting an imul instruction here instead of just bit shifts, etc.? I find this pretty annoying, because I'm doing pointer arithmetic like this in a tight loop, and I'm suspecting imul is killing its performance. In any case, it shouldn't be necessary.

Is there a good way to prevent it in general and instead replace it with cheaper operations?

Update:

In my original program, I tried adding a dummy variable to make the size of each element a multiple of 4 instead of 3 so the compiler can use bit-shifting instead of division.

The result? Even though the data structure was bigger, the running time of the program decreased from 9.2 seconds to 7.4 seconds.

So yes, this is indeed very slow.

Community
  • 1
  • 1
user541686
  • 205,094
  • 128
  • 528
  • 886
  • Have you tried replacing the mul instruction with a bit shift and benchmarking the result? – Fred Foo Feb 13 '14 at 11:43
  • @larsmans: No, because I'm using vector iterators (which are pointers underneath) and therefore I can't necessarily dereference them to get back raw pointers. If I did do that, I could accidentally dereference the past-the-end iterator, causing undefined behavior. (Checking for a past-the-end iterator itself would add excess instructions.) – user541686 Feb 13 '14 at 11:45
  • If you're using vectors, why not operate on indices then, if what you're doing is more or less calculating an index (or rather, the byte offset from the beginning of the vector to an element at some index) there? – Damon Feb 13 '14 at 11:47
  • @Damon: Because the algorithm is meant to be generic and the input isn't guaranteed to be a vector iterator. It just happens to be in this particular case, and my goal here is not to provide separate code for vector iterators (that's missing the point). – user541686 Feb 13 '14 at 11:48
  • 3
    Hm, what's the optimization level? [gcc at -O2](http://gcc.godbolt.org/#{%22version%22%3A3%2C%22filterAsm%22%3A{%22labels%22%3Atrue%2C%22directives%22%3Atrue%2C%22commentOnly%22%3Atrue}%2C%22compilers%22%3A[{%22source%22%3A%22int%20main%28int%20argc%2C%20char%20*argv[]%29\n{\n%20%20%20%20int%20temp[50][3]%3B\n%20%20%20%20return%20%26temp[argc]%20-%20%26temp[0]%3B\n}\n%22%2C%22compiler%22%3A%22%2Fusr%2Fbin%2Fg%2B%2B-4.8%22%2C%22options%22%3A%22-O2%20-std%3Dc%2B%2B11%22}]}) – jrok Feb 13 '14 at 11:49
  • 1
    But then, if the input isn't guaranteed to be a vector iterator, the assumption "will be pointers" is rather bold. – Damon Feb 13 '14 at 11:51
  • 1
    "I'm suspecting imul is killing its performance" ... you better check this. Your compiler is almost always right about this stuff. – tenfour Feb 13 '14 at 12:03
  • @Damon: I don't get what you're trying to say. The algorithm doesn't care what the iterator type is, but in my particular project I'm using vector iterators. – user541686 Feb 13 '14 at 21:27
  • I was just wondering that you said you're using vector iterators which are pointers underneath. But then you said it isn't necessarily vector iterators. Which means that the iterators could be something quite different from pointers (so doing pointer arith would be kind of UB anyway). But I guess it doesn't matter for the question about removing the IMUL. – Damon Feb 13 '14 at 21:30
  • @tenfour: I checked it and, in fact, I was right. See my update. – user541686 Feb 13 '14 at 21:41
  • This specific test case is just a missed-optimization by MSVC. As @jrok points out, gcc correctly optimizes it to `return argc`. Actually multiplying by 3 (with `lea`) and then [dividing with a fixed-point multiplicative inverse](https://stackoverflow.com/questions/41183935/why-does-gcc-use-multiplication-by-a-strange-number-in-implementing-integer-divi) will give a different result only in cases of UB: when `&temp[argc]` is out-of-bounds. The maximum supported object size in gcc is half the address space (max signed `ptrdiff_t`, not max `size_t`), so this is valid there, and prob. on MSVC. – Peter Cordes Aug 30 '18 at 23:59

1 Answers1

8

Why am I getting an imul instruction here instead of just bit shifts, etc.?

The multiplication is to divide by 3 (the size of each inner array), using the fact that 0x2AAAAAAB is 231/3. You can't do that with a small number of shifts and adds; multiplication really is the fastest option.

I'm suspecting imul is killing its performance.

On most modern platforms, integer multiplication is usually as fast as simpler operations, so it may be the fastest option even when it could be replaced by a couple of shifts and adds. When you have performance issues, always measure to find the real bottlenecks; they're often in the places you least suspect.

Is there a good way to prevent it in general and instead replace it with cheaper operations?

On platforms where multiplication really is expensive: avoid arrays of awkwardly sized data structures; and avoid subtracting pointers, since that requires division by the object size.

Mike Seymour
  • 249,747
  • 28
  • 448
  • 644
  • Just wanted to mention that this really does kill performance even on x86 (see my update). So your last paragraph was right on the mark, thanks. – user541686 Feb 13 '14 at 21:43
  • 1
    Integer multiplication is fast (`imul eax, ecx` is 3c latency / 1c throughput on Intel since Nehalem, and on AMD since Ryzen), but integer `add` is even faster: 1c latency / 0.25c throughput (Intel since Haswell). For code that can auto-vectorize, SIMD integer add is also much cheaper than SIMD integer multiply, like 10c vs. 1c latency. (And a similar throughput difference to scalar integer). See https://agner.org/optimize/. But yes, `imul` is better than multiple shifts/adds, and is a single uop (or a couple uops for the version that produces a high-half result like here). – Peter Cordes Aug 30 '18 at 23:52
  • gcc and clang both use up to 2 LEA or other instructions to multiply by small constants or powers of 2, but use `imul` instead of 3 other instructions. – Peter Cordes Aug 30 '18 at 23:57