Edit: See Adam's answer above for a version using SSE intrinsics. Better than what I had here ...
To make this more useful, let's look at compiler-generated code here. I'm using gcc 4.8.0 and yes, it is worth checking your specific compiler (version) as there are quite significant differences in output for, say, gcc 4.4, 4.8, clang 3.2 or Intel's icc.
Your original, using g++ -O8 -msse4.2 ...
translates into the following loop:
.L2:
cvtsi2sd (%rcx,%rax,4), %xmm0
mulsd %xmm1, %xmm0
addl $1, %edx
movsd %xmm0, (%rsi,%rax,8)
movslq %edx, %rax
cmpq %rdi, %rax
jbe .L2
where %xmm1
holds 1.0/32768.0
so the compiler automatically turns the division into multiplication-by-reverse.
On the other hand, using g++ -msse4.2 -O8 -funroll-loops ...
, the code created for the loop changes significantly:
[ ... ]
leaq -1(%rax), %rdi
movq %rdi, %r8
andl $7, %r8d
je .L3
[ ... insert a duff's device here, up to 6 * 2 conversions ... ]
jmp .L3
.p2align 4,,10
.p2align 3
.L39:
leaq 2(%rsi), %r11
cvtsi2sd (%rdx,%r10,4), %xmm9
mulsd %xmm0, %xmm9
leaq 5(%rsi), %r9
leaq 3(%rsi), %rax
leaq 4(%rsi), %r8
cvtsi2sd (%rdx,%r11,4), %xmm10
mulsd %xmm0, %xmm10
cvtsi2sd (%rdx,%rax,4), %xmm11
cvtsi2sd (%rdx,%r8,4), %xmm12
cvtsi2sd (%rdx,%r9,4), %xmm13
movsd %xmm9, (%rcx,%r10,8)
leaq 6(%rsi), %r10
mulsd %xmm0, %xmm11
mulsd %xmm0, %xmm12
movsd %xmm10, (%rcx,%r11,8)
leaq 7(%rsi), %r11
mulsd %xmm0, %xmm13
cvtsi2sd (%rdx,%r10,4), %xmm14
mulsd %xmm0, %xmm14
cvtsi2sd (%rdx,%r11,4), %xmm15
mulsd %xmm0, %xmm15
movsd %xmm11, (%rcx,%rax,8)
movsd %xmm12, (%rcx,%r8,8)
movsd %xmm13, (%rcx,%r9,8)
leaq 8(%rsi), %r9
movsd %xmm14, (%rcx,%r10,8)
movsd %xmm15, (%rcx,%r11,8)
movq %r9, %rsi
.L3:
cvtsi2sd (%rdx,%r9,4), %xmm8
mulsd %xmm0, %xmm8
leaq 1(%rsi), %r10
cmpq %rdi, %r10
movsd %xmm8, (%rcx,%r9,8)
jbe .L39
[ ... out ... ]
So it blocks the operations up, but still converts one-value-at-a-time.
If you change your original loop to operate on a few elements per iteration:
size_t i;
for (i = 0; i < uIntegers.size() - 3; i += 4)
{
uDoubles[i] = uIntegers[i] / 32768.0;
uDoubles[i+1] = uIntegers[i+1] / 32768.0;
uDoubles[i+2] = uIntegers[i+2] / 32768.0;
uDoubles[i+3] = uIntegers[i+3] / 32768.0;
}
for (; i < uIntegers.size(); i++)
uDoubles[i] = uIntegers[i] / 32768.0;
the compiler, gcc -msse4.2 -O8 ...
(i.e. even without requesting unrolling), identifies the potential to use CVTDQ2PD
/MULPD
and the core of the loop becomes:
.p2align 4,,10
.p2align 3
.L4:
movdqu (%rcx), %xmm0
addq $16, %rcx
cvtdq2pd %xmm0, %xmm1
pshufd $238, %xmm0, %xmm0
mulpd %xmm2, %xmm1
cvtdq2pd %xmm0, %xmm0
mulpd %xmm2, %xmm0
movlpd %xmm1, (%rdx,%rax,8)
movhpd %xmm1, 8(%rdx,%rax,8)
movlpd %xmm0, 16(%rdx,%rax,8)
movhpd %xmm0, 24(%rdx,%rax,8)
addq $4, %rax
cmpq %r8, %rax
jb .L4
cmpq %rdi, %rax
jae .L29
[ ... duff's device style for the "tail" ... ]
.L29:
rep ret
I.e. now the compiler recognizes the opportunity to put two double
per SSE register, and do parallel multiply / conversion. This is pretty close to the code that Adam's SSE intrinsics version would generate.
The code in total (I've shown only about 1/6th of it) is much more complex than the "direct" intrinsics, due to the fact that, as mentioned, the compiler tries to prepend/append unaligned / not-block-multiple "heads" and "tails" to the loop. It largely depends on the average/expected sizes of your vectors whether this will be beneficial or not; for the "generic" case (vectors more than twice the size of the block processed by the "innermost" loop), it'll help.
The result of this exercise is, largely ... that, if you coerce (by compiler options/optimization) or hint (by slightly rearranging the code) your compiler to do the right thing, then for this specific kind of copy/convert loop, it comes up with code that's not going to be much behind hand-written intrinsics.
Final experiment ... make the code:
static double c(int x) { return x / 32768.0; }
void Convert(const std::vector<int>& uIntegers, std::vector<double>& uDoubles)
{
std::transform(uIntegers.begin(), uIntegers.end(), uDoubles.begin(), c);
}
and (for the nicest-to-read assembly output, this time using gcc 4.4 with gcc -O8 -msse4.2 ...
) the generated assembly core loop (again, there's a pre/post bit) becomes:
.p2align 4,,10
.p2align 3
.L8:
movdqu (%r9,%rax), %xmm0
addq $1, %rcx
cvtdq2pd %xmm0, %xmm1
pshufd $238, %xmm0, %xmm0
mulpd %xmm2, %xmm1
cvtdq2pd %xmm0, %xmm0
mulpd %xmm2, %xmm0
movapd %xmm1, (%rsi,%rax,2)
movapd %xmm0, 16(%rsi,%rax,2)
addq $16, %rax
cmpq %rcx, %rdi
ja .L8
cmpq %rbx, %rbp
leaq (%r11,%rbx,4), %r11
leaq (%rdx,%rbx,8), %rdx
je .L10
[ ... ]
.L10:
[ ... ]
ret
With that, what do we learn ? If you want to use C++, really use C++ ;-)