2

I always use double to do calculations but double offers far better accuracy than I need (or makes sense, considering that most of the calculations I do are approximations to begin with).

But since the processor is already 64bit, I do not expect that using a type with less bits will be of any benefit.

Am I right/wrong, how would I optimize for speed (I understand that smaller types would be more memory efficient)

here is the test

#include <cmath>
#include <ctime>
#include <cstdio>

template<typename T>
void creatematrix(int m,int n, T **&M){
    M = new T*[m];
    T *M_data = new T[m*n];

    for(int i=0; i< m; ++i) 
    {
        M[i] = M_data + i * n;
    }
}

void main(){
    clock_t start,end;
    double diffs;
    const int N = 4096;
    const int rep =8;

    float **m1,**m2;
    creatematrix(N,N,m1);creatematrix(N,N,m2);

    start=clock();
    for(int k = 0;k<rep;k++){
        for(int i = 0;i<N;i++){
            for(int j =0;j<N;j++)
                m1[i][j]=sqrt(m1[i][j]*m2[i][j]+0.1586);
        }
    }
    end = clock();
    diffs = (end - start)/(double)CLOCKS_PER_SEC;
    printf("time = %lf\n",diffs);


    delete[] m1[0];
    delete[] m1;

    delete[] m2[0];
    delete[] m2;

    getchar();
}

there was no time difference between double and float, however when square root is not used, float is twice as fast.

Paul R
  • 208,748
  • 37
  • 389
  • 560
Gappy Hilmore
  • 440
  • 4
  • 13
  • 1
    Operations like `/` and `sqrt` are typically much slower for `double` than for `float` - check the throughput/latency for the operations that you are interested in on your target CPU. – Paul R Sep 01 '15 at 17:50
  • 2
    It matters when you use a lot of them, cache utilization is better. Vectorized code could perform better as well. Tinker with the math model only when you know what you're doing, it is very rarely an arbitrary choice. – Hans Passant Sep 01 '15 at 18:00
  • @PaulR I updated my post and did a test. Double and float had almost no time difference. – Gappy Hilmore Sep 04 '15 at 16:19
  • I just found out that the float version of sqrt is sqrtf, and this caused a 20% decrease in time – Gappy Hilmore Sep 04 '15 at 16:24
  • If you put a division operation in the loop you may see a bigger difference, e.g. change `+0.1586` to `/0.1586` (or `/0.1586f` for the `float` version). – Paul R Sep 04 '15 at 16:26

1 Answers1

4

There are a couple of ways they can be faster:

  • Faster I/O: you have only half the bits to move between disk/memory/cache/registers
  • Typically the only operations that are slower are square-root and division. As an example, on a Haswell a DIVSS (float division) takes 7 clock cycles, whereas a DIVSD (double division) takes 8-14 (source: Agner Fog's tables).
  • If you can take advantage of SIMD instructions, then you can handle twice as many per instruction (i.e. in a 128-bit SSE register, you can operate on 4 floats, but only 2 doubles).
  • Special functions (log, sin) can use lower-degree polynomials: e.g. the openlibm implementation of log uses a degree 7 polynomial, whereas logf only needs degree 4.
  • If you need higher intermediate precision, you can simply promote float to double, whereas for a double you need either software double-double, or slower long double.

Note that these points also hold for 32-bit architectures as well: unlike integers, there's nothing particularly special about having the size of the format match your architecture, i.e. on most machines doubles are just as "native" as floats.

Simon Byrne
  • 7,694
  • 1
  • 26
  • 50