Short Answer: THAT DEPENDS.
Long Answer:
Yes, it is very possible IF you can use things that the compiler cannot automatically deduce. However, in my experience this is quite rare; most compilers are pretty good at vectorizing nowadays. However, much depends on how you model your data and how willing you are to create incredibly complex code. For most users, I wouldn't recommend going through the trouble in the first place.
To give you an example, here's the implementation of x / 10 where x is a signed integer (this is actually what the compiler will generate):
int eax = value * 0x66666667;
int edx = ([overflow from multiplication] >> 2); // NOTE: use aritmetic shift here!
int result = (edx >> 31) + edx;
If you disassemble your compiled C++ code, and you used a constant for the '10', it will show the assembly code reflecting the above. If you didn't use a constant, it'll generate a idiv
, which is much slower.
Knowing your memory is aligned c.q. knowing that your code can be vectorized, is something that can be very beneficial. Do note that this does require you to store your data in such a way that this is possible.
For example, if you want to calculate the sum-of-div/10's of all integers, you can do something like this:
__m256i ctr = _mm256_set_epi32(0, 1, 2, 3, 4, 5, 6, 7);
ctr = _mm256_add_epi32(_mm256_set1_epi32(INT32_MIN), ctr);
__m256i sumdiv = _mm256_set1_epi32(0);
const __m256i magic = _mm256_set1_epi32(0x66666667);
const int shift = 2;
// Show that this is correct:
for (long long int i = INT32_MIN; i <= INT32_MAX; i += 8)
{
// Compute the overflow values
__m256i ovf1 = _mm256_srli_epi64(_mm256_mul_epi32(ctr, magic), 32);
__m256i ovf2 = _mm256_mul_epi32(_mm256_srli_epi64(ctr, 32), magic);
// blend the overflows together again
__m256i rem = _mm256_srai_epi32(_mm256_blend_epi32(ovf1, ovf2, 0xAA), shift);
// calculate the div value
__m256i div = _mm256_add_epi32(rem, _mm256_srli_epi32(rem, 31));
// do something with the result; increment the counter
sumdiv = _mm256_add_epi32(sumdiv, div);
ctr = _mm256_add_epi32(ctr, _mm256_set1_epi32(8));
}
int sum = 0;
for (int i = 0; i < 8; ++i) { sum += sumdiv.m256i_i32[i]; }
std::cout << sum << std::endl;
If you benchmark both implementations, you will find that on an Intel Haswell processor, you'll get these results:
- idiv: 1,4 GB/s
- compiler optimized: 4 GB/s
- AVX2 instructions: 16 GB/s
For other powers of 10 and unsigned division, I recommend reading the paper.