12

I would like to ask if there is a quicker way to do my audio conversion than by iterating through all values one by one and dividing them through 32768.

void CAudioDataItem::Convert(const vector<int>&uIntegers, vector<double> &uDoubles)
{
    for ( int i = 0; i <=uIntegers.size()-1;i++)
    {
        uDoubles[i] = uIntegers[i] / 32768.0;
    }
}

My approach works fine, but it could be quicker. However I did not find any way to speed it up.

Thank you for the help!

Mysticial
  • 464,885
  • 45
  • 335
  • 332
tmighty
  • 10,734
  • 21
  • 104
  • 218
  • 1
    You can try multiplying by `1 / 32768.0` instead. Not sure if the compiler is allowed to do that by itself since it breaks IEEE. – Mysticial Jun 24 '13 at 18:35
  • use cuda and do it all at once' – aaronman Jun 24 '13 at 18:38
  • 2
    @aaronman cuda isn't worth it for this. The transformation is trivial. Moving the data would be a huge bottleneck. – harold Jun 24 '13 at 18:40
  • No, there is no trick in C++ to dividing large chunks of memory. You are stuck with iterating over the whole loop. Make sure to compile with optimizations on, it will speed things up a lot, particularly if you are dealing with large vectors. – MobA11y Jun 24 '13 at 18:42
  • Do you need standard C++, or can you use implementation-specific behavior? What is your target platform? Many general-purpose processors these days have SIMD features that can perform two or four `double` operations in one instruction. Accessing them requires non-standard code. – Eric Postpischil Jun 24 '13 at 18:43
  • @Mysticial GCC optimizes this already at optimization level `-O1` – SirGuy Jun 24 '13 at 18:45
  • @Eric: this is good style, but optimization will do what you describe for you. – MobA11y Jun 24 '13 at 18:46
  • @GuyGreer: Only with --ffast-math – Sebastian Mach Jun 24 '13 at 18:46
  • @Mystical: it is best to let the optimizer take care of differences like multiplying or dividing by constants, and leave the code the way it is. – MobA11y Jun 24 '13 at 18:47
  • @GuyGreer Now I'm actually questioning whether this optimization breaks strict IEEE at all. – Mysticial Jun 24 '13 at 18:47
  • @ChrisCM Well, that's definitely the case for integers. But for FP, if the transformation breaks strict FP, then the compiler isn't allowed to do it without `--ffast-math`. It may be the case that dividing and multiplying by constants can be optimized without breaking strict-FP. I haven't convinced myself either way yet. – Mysticial Jun 24 '13 at 18:49
  • 2
    @Mysticial: I think it is okay. The mathematical results of x/32768 and x•(1/32768) are identical, so they will generate identical results with rounding, underflow, et cetera. – Eric Postpischil Jun 24 '13 at 18:50
  • I believe the key here is that in both multiply and divide, the second operand is a constant, and therefore should be okay. Might be very version dependent though, I agree. – MobA11y Jun 24 '13 at 18:51
  • Oh, and I'm +1'ing this question since whether or not this optimization is done is clearly not as clear-cut as we thought it would given this comment trail. – Mysticial Jun 24 '13 at 18:51
  • 1
    @ChrisCM Which optimizations do you mean? I have not used compiler optimization yet. – tmighty Jun 24 '13 at 18:53
  • @ChrisCM: The operand needs to have an exact reciprocal in floating-point values. Which implies it is a power of two. – Eric Postpischil Jun 24 '13 at 18:53
  • @Mysticial With a value that is not a power of 2, gcc uses division, not multiplication, so it is not a general thing – SirGuy Jun 24 '13 at 19:01
  • Can we take advantage of [return value optimization](http://en.wikipedia.org/wiki/Return_value_optimization) here? – andre Jun 24 '13 at 19:05
  • There’s a bug in your code: you are iterating one element too many. – Konrad Rudolph Jun 24 '13 at 19:40
  • With a tight loop like this you might find yourself constrained by the speed of memory access, which would make the instruction speed itself totally irrelevant. – Mark Ransom Jun 24 '13 at 19:42
  • @MarkRansom, which leads to the question: what is going to be done with the array of doubles? If they are being fed into a transform, do the conversion as the first step in the transform loop. – AShelly Jun 24 '13 at 19:53
  • @tmighty, if you want the exactly algorithm you have now I don't find much improvement with GCC (since it already vectorizes the function). However, with MSVC you can improve it quite a bit. The best way I can think of to improve your results is to not do the conversion all at once but in l1 cache blocks between other operations. That way the CPU is not stalling due to the slower cache levels/main memory. See my answer. –  Jun 25 '13 at 12:55

6 Answers6

3

If your array is large enough it may be worthwhile to parallelize this for loop. OpenMP's parallel for statement is what I would use.
The function would then be:

    void CAudioDataItem::Convert(const vector<int>&uIntegers, vector<double> &uDoubles)
    {
        #pragma omp parallel for
        for (int i = 0; i < uIntegers.size(); i++)
        {
            uDoubles[i] = uIntegers[i] / 32768.0;
        }
    }

with gcc you need to pass -fopenmp when you compile for the pragma to be used, on MSVC it is /openmp. Since spawning threads has a noticeable overhead, this will only be faster if you are processing large arrays, YMMV.

SirGuy
  • 10,660
  • 2
  • 36
  • 66
3

For maximum speed you want to convert more than one value per loop iteration. The easiest way to do that is with SIMD. Here's roughly how you'd do it with SSE2:

void CAudioDataItem::Convert(const vector<int>&uIntegers, vector<double> &uDoubles)
{
    __m128d scale = _mm_set_pd( 1.0 / 32768.0, 1.0 / 32768.0 );
    int i = 0;
    for ( ; i < uIntegers.size() - 3; i += 4)
    {
        __m128i x = _mm_loadu_si128(&uIntegers[i]);
        __m128i y = _mm_shuffle_epi32(x, _MM_SHUFFLE(2,3,0,0) );
        __m128d dx = _mm_cvtepi32_pd(x);
        __m128d dy = _mm_cvtepi32_pd(y);
        dx = _mm_mul_pd(dx, scale);
        dy = _mm_mul_pd(dy, scale);
        _mm_storeu_pd(dx, &uDoubles[i]);
        _mm_storeu_pd(dy, &uDoubles[i + 2]);
    }
    // Finish off the last 0-3 elements the slow way
    for ( ; i < uIntegers.size(); i ++)
    {
        uDoubles[i] = uIntegers[i] / 32768.0;
    }
}

We process four integers per loop iteration. As we can only fit two doubles in the registers there's some duplicated work, but the extra unrolling will help performance unless the arrays are tiny.

Changing the data types to smaller ones (say short and float) should also help performance, because they cut down on memory bandwidth, and you can fit four floats in an SSE register. For audio data you shouldn't need the precision of a double.

Note that I've used unaligned loads and stores. Aligned ones will be slightly quicker if the data is actually aligned (which it won't be by default, and it's hard to make stuff aligned inside a std::vector).

Adam
  • 882
  • 5
  • 10
  • +1 Your SSE2 is better than mine - illustrating the blocked operation as well ;-) – FrankH. Jun 24 '13 at 22:09
  • Thanks. You might also want to look at how many doubles your code writes when the input vector is size 2 ;) – Adam Jun 24 '13 at 22:36
  • Noted ... I need an `if (i < uIntegers.size()) ...` test ;-) – FrankH. Jun 25 '13 at 08:03
  • For maximum speed you want to use aligned memory, AVX, and OpenMP. –  Jun 25 '13 at 12:25
  • Aligned memory is a lot quicker not slightly quicker. –  Jun 25 '13 at 12:30
  • I have not tried to use std::vector aligned but here is a discussion on it http://stackoverflow.com/questions/12942548/making-stdvector-allocate-aligned-memory –  Jun 25 '13 at 12:30
3

Your function is highly parallelizable. On modern Intel CPU there are three independent ways to parallelize: Instruction level parallelism (ILP), thread level parallelism (TLP), and SIMD. I was able to use all three to get big boosts in your function. The results are compiler dependent though. The boost is much less using GCC since it already vectorizes the function. See the table of numbers below.

However, the main limiting factor in your function is that it's time complexity is only O(n) and so there is a drastic drop in efficiency when the size of the array you're running over crosses each cache level boundary. If you look at for example at large dense matrix multiplication (GEMM) it's a O(n^3) operation so if one does things right (using e.g. loop tiling) the cache hierarchy is not a problem: you can get close to the maximum flops/s even for very large matrices (which seems to indicate that GEMM is one of the thinks Intel thinks of when they design the CPU). The way to fix this in your case is to find a way to do your function on a L1 cache block right after/before you do a more complex operation (for example that goes as O(n^2)) and then move to another L1 block. Of course I don't know what you're doing so I can't do that.

ILP is partially done for you by the CPU hardware. However, often carried loop dependencies limit the ILP so it often helps to do loop unrolling to take full advantage of the ILP. For TLP I use OpenMP, and for SIMD I used AVX (however the code below works for SSE as well). I used 32 byte aligned memory and made sure the array was a multiple of 8 so that no clean up was necessary.

Here are the results from Visual Studio 2012 64bit with AVX and OpenMP (release mode obviously) SandyBridge EP 4 cores (8 HW threads) @3.6 GHz. The variable n is the number of items. I repeat the function several times as well so the total time includes that. The function convert_vec4_unroll2_openmp gives the best results except in the L1 region. You can also cleary see that the efficiency drops significantly each time you move to a new cache level but even for main memory it's still better.

l1 chache, n 2752, repeat 300000
    covert time 1.34, error 0.000000
    convert_vec4 time 0.16, error 0.000000
    convert_vec4_unroll2 time 0.16, error 0.000000
    convert_vec4_unroll2_openmp time 0.31, error 0.000000

l2 chache, n 21856, repeat 30000
    covert time 1.14, error 0.000000
    convert_vec4 time 0.24, error 0.000000
    convert_vec4_unroll2 time 0.24, error 0.000000
    convert_vec4_unroll2_openmp time 0.12, error 0.000000

l3 chache, n 699072, repeat 1000
    covert time 1.23, error 0.000000
    convert_vec4 time 0.44, error 0.000000
    convert_vec4_unroll2 time 0.45, error 0.000000
    convert_vec4_unroll2_openmp time 0.14, error 0.000000

main memory , n 8738144, repeat 100
    covert time 1.56, error 0.000000
    convert_vec4 time 0.95, error 0.000000
    convert_vec4_unroll2 time 0.89, error 0.000000
    convert_vec4_unroll2_openmp time 0.51, error 0.000000

Results with g++ foo.cpp -mavx -fopenmp -ffast-math -O3 on a i5-3317 (ivy bridge) @ 2.4 GHz 2 cores (4 HW threads). GCC seems to vectorize this and the only benefit comes from OpenMP (which, however, gives a worse result in the L1 region).

l1 chache, n 2752, repeat 300000
    covert time 0.26, error 0.000000
    convert_vec4 time 0.22, error 0.000000
    convert_vec4_unroll2 time 0.21, error 0.000000
    convert_vec4_unroll2_openmp time 0.46, error 0.000000

l2 chache, n 21856, repeat 30000
    covert time 0.28, error 0.000000
    convert_vec4 time 0.27, error 0.000000
    convert_vec4_unroll2 time 0.27, error 0.000000
    convert_vec4_unroll2_openmp time 0.20, error 0.000000

l3 chache, n 699072, repeat 1000
    covert time 0.80, error 0.000000
    convert_vec4 time 0.80, error 0.000000
    convert_vec4_unroll2 time 0.80, error 0.000000
    convert_vec4_unroll2_openmp time 0.83, error 0.000000

main memory chache, n 8738144, repeat 100
    covert time 1.10, error 0.000000
    convert_vec4 time 1.10, error 0.000000
    convert_vec4_unroll2 time 1.10, error 0.000000
    convert_vec4_unroll2_openmp time 1.00, error 0.000000

Here is the code. I use the vectorclass http://www.agner.org/optimize/vectorclass.zip to do SIMD. This will use either AVX to write 4 doubles at once or SSE to write 2 doubles at once.

#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#include "vectorclass.h"

void convert(const int *uIntegers, double *uDoubles, const int n) {
    for ( int i = 0; i<n; i++) {
        uDoubles[i] = uIntegers[i] / 32768.0;
    }
}

void convert_vec4(const int *uIntegers, double *uDoubles, const int n) {
    Vec4d div = 1.0/32768;
    for ( int i = 0; i<n; i+=4) {
        Vec4i u4i = Vec4i().load(&uIntegers[i]);
        Vec4d u4d  = to_double(u4i);
        u4d*=div;
        u4d.store(&uDoubles[i]);
    }
}

void convert_vec4_unroll2(const int *uIntegers, double *uDoubles, const int n) {
    Vec4d div = 1.0/32768;
    for ( int i = 0; i<n; i+=8) {
        Vec4i u4i_v1 = Vec4i().load(&uIntegers[i]);
        Vec4d u4d_v1  = to_double(u4i_v1);
        u4d_v1*=div;
        u4d_v1.store(&uDoubles[i]);

        Vec4i u4i_v2 = Vec4i().load(&uIntegers[i+4]);
        Vec4d u4d_v2  = to_double(u4i_v2);
        u4d_v2*=div;
        u4d_v2.store(&uDoubles[i+4]);
    }
}

void convert_vec4_openmp(const int *uIntegers, double *uDoubles, const int n) {
    #pragma omp parallel for    
    for ( int i = 0; i<n; i+=4) {
        Vec4i u4i = Vec4i().load(&uIntegers[i]);
        Vec4d u4d  = to_double(u4i);
        u4d/=32768.0;
        u4d.store(&uDoubles[i]);
    }
}

void convert_vec4_unroll2_openmp(const int *uIntegers, double *uDoubles, const int n) {
    Vec4d div = 1.0/32768;
    #pragma omp parallel for    
    for ( int i = 0; i<n; i+=8) {
        Vec4i u4i_v1 = Vec4i().load(&uIntegers[i]);
        Vec4d u4d_v1  = to_double(u4i_v1);
        u4d_v1*=div;
        u4d_v1.store(&uDoubles[i]);

        Vec4i u4i_v2 = Vec4i().load(&uIntegers[i+4]);
        Vec4d u4d_v2  = to_double(u4i_v2);
        u4d_v2*=div;
        u4d_v2.store(&uDoubles[i+4]);
    }
}

double compare(double *a, double *b, const int n) {
    double diff = 0.0;
    for(int i=0; i<n; i++) {
        double tmp = a[i] - b[i];
        //printf("%d %f %f \n", i, a[i], b[i]);
        if(tmp<0) tmp*=-1;
        diff += tmp;
    }
    return diff;
}

void loop(const int n, const int repeat, const int ifunc) {
    void (*fp[4])(const int *uIntegers, double *uDoubles, const int n);

    int *a = (int*)_mm_malloc(sizeof(int)* n, 32);
    double *b1_cmp = (double*)_mm_malloc(sizeof(double)*n, 32);
    double *b1 = (double*)_mm_malloc(sizeof(double)*n, 32);
    double dtime;

    const char *fp_str[] = {
        "covert",
        "convert_vec4",
        "convert_vec4_unroll2",
        "convert_vec4_unroll2_openmp",
    };

    for(int i=0; i<n; i++) {
        a[i] = rand()*RAND_MAX;
    }

    fp[0] = convert;
    fp[1] = convert_vec4;
    fp[2] = convert_vec4_unroll2;
    fp[3] = convert_vec4_unroll2_openmp;

    convert(a, b1_cmp, n);

    dtime = omp_get_wtime();
    for(int i=0; i<repeat; i++) {
        fp[ifunc](a, b1, n);
    }
    dtime = omp_get_wtime() - dtime;
    printf("\t%s time %.2f, error %f\n", fp_str[ifunc], dtime, compare(b1_cmp,b1,n));

    _mm_free(a);
    _mm_free(b1_cmp);
    _mm_free(b1);
}

int main() {
    double dtime;
    int l1 = (32*1024)/(sizeof(int) + sizeof(double));
    int l2 = (256*1024)/(sizeof(int) + sizeof(double));
    int l3 = (8*1024*1024)/(sizeof(int) + sizeof(double));
    int lx = (100*1024*1024)/(sizeof(int) + sizeof(double));
    int n[] = {l1, l2, l3, lx};

    int repeat[] = {300000, 30000, 1000, 100};
    const char *cache_str[] = {"l1", "l2", "l3", "main memory"};
    for(int c=0; c<4; c++ ) {
        int lda = ((n[c]+7) & -8); //make sure array is a multiple of 8
        printf("%s chache, n %d\n", cache_str[c], lda);
        for(int i=0; i<4; i++) {
            loop(lda, repeat[c], i);
        } printf("\n");
    }
}

Lastly, anyone who has read this far and feels like reminding me that my code looks more like C than C++ please read this first before you decide to comment http://www.stroustrup.com/sibling_rivalry.pdf

1

You might also try:

uDoubles[i] = ldexp((double)uIntegers[i], -15);
Lee Daniel Crocker
  • 12,927
  • 3
  • 29
  • 55
  • Yeah, probably function call overhead; but a really good compiler might inline it. – Lee Daniel Crocker Jun 24 '13 at 18:56
  • I'd be interested to see how the compiler handles this. It might actually be smart enough to just emit a multiply by `1 / 32768.0`. (Not that I'd count on it though.) – Mysticial Jun 24 '13 at 18:59
  • Inline it to what? The existing code is likely a convert-to-double and a floating-point multiply. You are not likely to get faster without either a processor instruction to convert-and-scale simultaneously (which some processors have, and which the compiler could optimize the existing code to) or SIMD. – Eric Postpischil Jun 24 '13 at 18:59
  • GCC -O3 --fast-math reduces it to a singe x87 `fscale` instruction. Don't know whether or not that would be faster than the `fmul`. – Lee Daniel Crocker Jun 24 '13 at 19:11
  • On Core2, `fscale` has a latency of 43, `fmul` has 5. On Sandy Bridge, `fscale` has a latency of 12, `fmul` still has 5. – harold Jun 24 '13 at 19:23
  • Wow. Multiply it is, then. – Lee Daniel Crocker Jun 24 '13 at 19:27
  • Huh. I always thought an `frexp` or `ldexp` would usually be faster than multiplying by a power of two. Is that not the case even when there's no conversion involved? (The one reliable rule of optimization: your educated guesses are often wrong.) – aschepler Jun 24 '13 at 19:53
  • @aschepler adding directly to the exponent bits can be pretty fast if done with integer arithmetic (so with SSE or with regular integers, can't be done in x87 registers), but that disregards edge cases and the semantics of `ldexp` don't allow that. – harold Jun 24 '13 at 20:37
  • `frexp` and `ldexp` usually turn into function calls because they need to handle things like zero and subnormals correctly. They do more than just bit-fiddling. – tmyklebu Jun 25 '13 at 01:32
1

Edit: See Adam's answer above for a version using SSE intrinsics. Better than what I had here ...

To make this more useful, let's look at compiler-generated code here. I'm using gcc 4.8.0 and yes, it is worth checking your specific compiler (version) as there are quite significant differences in output for, say, gcc 4.4, 4.8, clang 3.2 or Intel's icc.

Your original, using g++ -O8 -msse4.2 ... translates into the following loop:

.L2:
    cvtsi2sd        (%rcx,%rax,4), %xmm0
    mulsd   %xmm1, %xmm0
    addl    $1, %edx
    movsd   %xmm0, (%rsi,%rax,8)
    movslq  %edx, %rax
    cmpq    %rdi, %rax
    jbe     .L2

where %xmm1 holds 1.0/32768.0 so the compiler automatically turns the division into multiplication-by-reverse.

On the other hand, using g++ -msse4.2 -O8 -funroll-loops ..., the code created for the loop changes significantly:

[ ... ]
    leaq    -1(%rax), %rdi
    movq    %rdi, %r8
    andl    $7, %r8d
    je      .L3
[ ... insert a duff's device here, up to 6 * 2 conversions ... ]
    jmp     .L3
    .p2align 4,,10
    .p2align 3
.L39:
    leaq    2(%rsi), %r11
    cvtsi2sd        (%rdx,%r10,4), %xmm9
    mulsd   %xmm0, %xmm9
    leaq    5(%rsi), %r9
    leaq    3(%rsi), %rax
    leaq    4(%rsi), %r8
    cvtsi2sd        (%rdx,%r11,4), %xmm10
    mulsd   %xmm0, %xmm10
    cvtsi2sd        (%rdx,%rax,4), %xmm11
    cvtsi2sd        (%rdx,%r8,4), %xmm12
    cvtsi2sd        (%rdx,%r9,4), %xmm13
    movsd   %xmm9, (%rcx,%r10,8)
    leaq    6(%rsi), %r10
    mulsd   %xmm0, %xmm11
    mulsd   %xmm0, %xmm12
    movsd   %xmm10, (%rcx,%r11,8)
    leaq    7(%rsi), %r11
    mulsd   %xmm0, %xmm13
    cvtsi2sd        (%rdx,%r10,4), %xmm14
    mulsd   %xmm0, %xmm14
    cvtsi2sd        (%rdx,%r11,4), %xmm15
    mulsd   %xmm0, %xmm15
    movsd   %xmm11, (%rcx,%rax,8)
    movsd   %xmm12, (%rcx,%r8,8)
    movsd   %xmm13, (%rcx,%r9,8)
    leaq    8(%rsi), %r9
    movsd   %xmm14, (%rcx,%r10,8)
    movsd   %xmm15, (%rcx,%r11,8)
    movq    %r9, %rsi
.L3:
    cvtsi2sd        (%rdx,%r9,4), %xmm8
    mulsd   %xmm0, %xmm8
    leaq    1(%rsi), %r10
    cmpq    %rdi, %r10
    movsd   %xmm8, (%rcx,%r9,8)
    jbe     .L39
[ ... out ... ]

So it blocks the operations up, but still converts one-value-at-a-time.

If you change your original loop to operate on a few elements per iteration:

size_t i;
for (i = 0; i < uIntegers.size() - 3; i += 4)
{
    uDoubles[i] = uIntegers[i] / 32768.0;
    uDoubles[i+1] = uIntegers[i+1] / 32768.0;
    uDoubles[i+2] = uIntegers[i+2] / 32768.0;
    uDoubles[i+3] = uIntegers[i+3] / 32768.0;
}
for (; i < uIntegers.size(); i++)
    uDoubles[i] = uIntegers[i] / 32768.0;

the compiler, gcc -msse4.2 -O8 ... (i.e. even without requesting unrolling), identifies the potential to use CVTDQ2PD/MULPD and the core of the loop becomes:

    .p2align 4,,10
    .p2align 3
.L4:
    movdqu  (%rcx), %xmm0
    addq    $16, %rcx
    cvtdq2pd        %xmm0, %xmm1
    pshufd  $238, %xmm0, %xmm0
    mulpd   %xmm2, %xmm1
    cvtdq2pd        %xmm0, %xmm0
    mulpd   %xmm2, %xmm0
    movlpd  %xmm1, (%rdx,%rax,8)
    movhpd  %xmm1, 8(%rdx,%rax,8)
    movlpd  %xmm0, 16(%rdx,%rax,8)
    movhpd  %xmm0, 24(%rdx,%rax,8)
    addq    $4, %rax
    cmpq    %r8, %rax
    jb      .L4
    cmpq    %rdi, %rax
    jae     .L29
[ ... duff's device style for the "tail" ... ]
.L29:
    rep ret

I.e. now the compiler recognizes the opportunity to put two double per SSE register, and do parallel multiply / conversion. This is pretty close to the code that Adam's SSE intrinsics version would generate.

The code in total (I've shown only about 1/6th of it) is much more complex than the "direct" intrinsics, due to the fact that, as mentioned, the compiler tries to prepend/append unaligned / not-block-multiple "heads" and "tails" to the loop. It largely depends on the average/expected sizes of your vectors whether this will be beneficial or not; for the "generic" case (vectors more than twice the size of the block processed by the "innermost" loop), it'll help.

The result of this exercise is, largely ... that, if you coerce (by compiler options/optimization) or hint (by slightly rearranging the code) your compiler to do the right thing, then for this specific kind of copy/convert loop, it comes up with code that's not going to be much behind hand-written intrinsics.

Final experiment ... make the code:

static double c(int x) { return x / 32768.0; }
void Convert(const std::vector<int>& uIntegers, std::vector<double>& uDoubles)
{
    std::transform(uIntegers.begin(), uIntegers.end(), uDoubles.begin(), c);
}

and (for the nicest-to-read assembly output, this time using gcc 4.4 with gcc -O8 -msse4.2 ...) the generated assembly core loop (again, there's a pre/post bit) becomes:

    .p2align 4,,10
    .p2align 3
.L8:
    movdqu  (%r9,%rax), %xmm0
    addq    $1, %rcx
    cvtdq2pd        %xmm0, %xmm1
    pshufd  $238, %xmm0, %xmm0
    mulpd   %xmm2, %xmm1
    cvtdq2pd        %xmm0, %xmm0
    mulpd   %xmm2, %xmm0
    movapd  %xmm1, (%rsi,%rax,2)
    movapd  %xmm0, 16(%rsi,%rax,2)
    addq    $16, %rax
    cmpq    %rcx, %rdi
    ja      .L8
    cmpq    %rbx, %rbp
    leaq    (%r11,%rbx,4), %r11
    leaq    (%rdx,%rbx,8), %rdx
    je      .L10
[ ... ]
.L10:
[ ... ]
    ret

With that, what do we learn ? If you want to use C++, really use C++ ;-)

FrankH.
  • 17,675
  • 3
  • 44
  • 63
  • What is the `-O8` switch supposed to do in gcc? I thought the maximum optimization level was `-O3`? – Piotr99 Jun 28 '13 at 08:18
  • 1
    In gcc, `-O...` level options beyond the "highest effective" just ... do the same as that (at least gcc 4.x seem fine with `-O9999999999999999999999999999999` ...). Call it habit; older gcc versions at one point refused to allow values above `-O8` ... – FrankH. Jun 30 '13 at 08:02
0

Let me try another way:

If multiplying is seriously better from the perspective of assembly instructions, then this should guarantee that it will get multiplied.

void CAudioDataItem::Convert(const vector<int>&uIntegers, vector<double> &uDoubles)
{
    for ( int i = 0; i <=uIntegers.size()-1;i++)
    {
        uDoubles[i] = uIntegers[i] * 0.000030517578125;
    }
}
ruben2020
  • 1,549
  • 14
  • 24
  • Since the reciprocal of the original divisor is exactly representable, why would you use a numeral that differs from the exact value and that is rounded again when converted to binary floating-point? Just use `(1 / 32768.)` or `0x1p-15`. – Eric Postpischil Jun 24 '13 at 19:54
  • With `(1/32768)`, could the compiler convert into a division? – ruben2020 Jun 24 '13 at 20:00
  • `1/32768` is zero, due to integer arithmetic, but the compiler may convert `1/32768.` to .000030517578125 at compile-time, particularly since the result is exact and raises no floating-point exceptions. Other expressions, such as `1/3.`, can raise exceptions (such as the exception for an inexact result), so whether a compiler optimizes them at compile-time depends on compiler quality and mode. (Some compilers have switches for more or less conformance to floating-point standards.) – Eric Postpischil Jun 24 '13 at 20:16
  • Another option is to use hexadecimal floating-point notation, which provides a way to specify floating-point values exactly in a way that is easy for software to convert. The numeral `0x1p-15` means 1 multiplied by two to the power of -15. – Eric Postpischil Jun 24 '13 at 20:18
  • ok. Thanks. This is the first time I've heard of the p-notation. – ruben2020 Jun 24 '13 at 20:43