Combining integers and floating point numbers: performance considerations

Question

I have a complex set of template functions which do calculations in a loop, combining floating point numbers and the uint32_t loop indices. I was surprised to observe that for this kind of functions, my test code runs faster with double precision floating point numbers than with single precision ones.

As a test, I changed the format of my indices to uint16_t. After this, both the double and float version of the program were faster (as expected), but now the float version was significantly faster than the double version. I also tested the program with uint64_t indices. In this case the double and the float version are equally slow.

I imagine that this is because an uint32_t fits into the mantissa of a double but not into a float. Once the indices type was reduced to uint16_t, they also fit into the mantissa of a float and a conversion should be trivial. In case of uint64_t, the conversion to double also needs rounding, which would explain why both versions perform equally.

Can anybody confirm this explanation?

EDIT: Using int or long as index type, the program runs as fast as for unit16_t. I guess this speaks against what I suspected first.

EDIT: I compiled the program for windows on an x86 architecture.

EDIT: Here is a piece of code that reproduces the effect of double being faster as float for uint32_t and both cases being equally fast for int. Please do not comment on the usefulness of this code. It is a modified fragment of code reproducing the effect which does nothing sensible.

The main file:

#include "stdafx.h"

typedef short spectraType;
typedef int intermediateValue;
typedef double returnType;

#include "Preprocess_t.h"
#include "Windows.h"
#include <iostream>

int main()
{
    const size_t numberOfBins = 10000;
    const size_t numberOfSpectra = 500;
    const size_t peakWidth = 25;

    bool startPeak = false;
    short peakHeight;

    Preprocess<short, returnType> myPreprocessor;
    std::vector<returnType> processedSpectrum;

    std::vector<std::vector<short>> spectra(numberOfSpectra, std::vector<short>(numberOfBins));


    std::vector<float> peakShape(peakWidth);

    LARGE_INTEGER freq, start, stop;
    double time_ms;

    QueryPerformanceFrequency(&freq);


    for (size_t i = 0; i < peakWidth; ++i)
    {
        peakShape[i] = static_cast<float>(exp(-(i - peakWidth / 2.0) *(i - peakWidth / 2.0) / 10.0));
    }


    for (size_t i = 0; i < numberOfSpectra; ++i)
    {
        size_t j = 0;
        for (; j < 200; ++j)
        {
            spectra[i][j] = rand() % 100;
        }

        for (size_t k = 0; k < 25; ++k)
        {
            spectra[i][j] = static_cast<short>(16383 * peakShape[k]);
            j++;
        }

        for (; j < numberOfBins; ++j)
        {
            startPeak = !static_cast<bool>(abs(rand()) % (numberOfBins / 4));

            if (startPeak)
            {
                peakHeight = rand() % 16384;

                for (size_t k = 0; k < 25 && j< numberOfBins; ++k)
                {
                    spectra[i][j] = peakHeight * peakShape[k] + rand() % 100;
                    j++;
                }
            }
            else
            {
                spectra[i][j] = rand() % 100;
            }
        }

        for (j = 0; j < numberOfBins; ++j)
        {
            double temp = 1000.0*exp(-(static_cast<float>(j) / (numberOfBins / 3.0)))*sin(static_cast<float>(j) / (numberOfBins / 10.0));
            spectra[i][j] -= static_cast<short>(1000.0*exp(-(static_cast<float>(j) / (numberOfBins / 3.0)))*sin(static_cast<float>(j) / (numberOfBins / 10.0)));
        }

    }

    // This is where the critical code is called

    QueryPerformanceCounter(&start);

    for (int i = 0; i < numberOfSpectra; ++i)
    {
        myPreprocessor.SetSpectrum(&spectra[i], 1000, &processedSpectrum);
        myPreprocessor.CorrectBaseline(30, 2.0);
    }

    QueryPerformanceCounter(&stop);

    time_ms = static_cast<double>(stop.QuadPart - start.QuadPart) / static_cast<double>(freq.QuadPart);

    std::cout << "time spend preprocessing: " << time_ms << std::endl;

    std::cin.ignore();

    return 0;
}

And the included header Preprocess_t.h:

#pragma once

#include <vector>

//typedef unsigned int indexType;
typedef unsigned short indexType;

template<typename T, typename Out_Type>
class Preprocess
{
public:
    Preprocess() :threshold(1), sdev(1), laserPeakThreshold(500), a(0), b(0), firstPointUsedAfterLaserPeak(0) {};
    ~Preprocess() {};

    void SetSpectrum(std::vector<T>* input, T laserPeakThreshold, std::vector<Out_Type>* processedSpectrum); ///@note We need the laserPeakThresholdParameter for the baseline correction, not onla for the shift.

    void CorrectBaseline(indexType numberOfPoints, Out_Type thresholdFactor);

private:

    void LinFitValues(indexType beginPoint);
    Out_Type SumOfSquareDiffs(Out_Type x, indexType n);

    Out_Type LinResidualSumOfSquareDist(indexType beginPoint);

    std::vector<T>* input;
    std::vector<Out_Type>* processedSpectrum;

    std::vector<indexType> fitWave_X;
    std::vector<Out_Type> fitWave;
    Out_Type threshold;
    Out_Type sdev;
    T laserPeakThreshold;
    Out_Type a, b;
    indexType firstPointUsedAfterLaserPeak;
    indexType numberOfPoints;
};

template<typename T, typename Out_Type>
void Preprocess<T, Out_Type>::CorrectBaseline(indexType numberOfPoints, Out_Type thresholdFactor)
{
    this->numberOfPoints = numberOfPoints;

    indexType numberOfBins = input->size();
    indexType firstPointUsedAfterLaserPeak = 0;

    indexType positionInFitWave = 0;

    positionInFitWave = firstPointUsedAfterLaserPeak;

    for (indexType i = firstPointUsedAfterLaserPeak; i < numberOfBins - numberOfPoints; i++) {
        LinFitValues(positionInFitWave);
        processedSpectrum->at(i + numberOfPoints) = input->at(i + numberOfPoints) - static_cast<Out_Type>(a + b*(i + numberOfPoints));


        positionInFitWave++;
        fitWave[positionInFitWave + numberOfPoints - 1] = input->at(i + numberOfPoints - 1);
        fitWave_X[positionInFitWave + numberOfPoints - 1] = i + numberOfPoints - 1;

    }
}

template<typename T, typename Out_Type>
void Preprocess<T, Out_Type>::LinFitValues(indexType beginPoint)
{
    Out_Type y_mean, x_mean, SSxy, SSxx, normFactor;
    y_mean = x_mean = SSxy = SSxx = normFactor = static_cast<Out_Type>(0);
    indexType endPoint = beginPoint + numberOfPoints;
    Out_Type temp;

    if ((fitWave_X[endPoint - 1] - fitWave_X[beginPoint]) == numberOfPoints)
    {
        x_mean = (fitWave_X[endPoint - 1] - fitWave_X[beginPoint]) / static_cast<Out_Type>(2);

        for (indexType i = beginPoint; i < endPoint; i++) {
            y_mean += fitWave[i];
        }

        y_mean /= numberOfPoints;

        SSxx = SumOfSquareDiffs(x_mean, fitWave_X[endPoint - 1]) - SumOfSquareDiffs(x_mean, fitWave_X[beginPoint]);

        for (indexType i = beginPoint; i < endPoint; i++)
        {
            SSxy += (fitWave_X[i] - x_mean)*(fitWave[i] - y_mean);
        }
    }
    else
    {
        for (indexType i = beginPoint; i < endPoint; i++) {
            y_mean += fitWave[i];
            x_mean += fitWave_X[i];
        }

        y_mean /= numberOfPoints;
        x_mean /= numberOfPoints;

        for (indexType i = beginPoint; i < endPoint; i++)
        {
            temp = (fitWave_X[i] - x_mean);
            SSxy += temp*(fitWave[i] - y_mean);
            SSxx += temp*temp;
        }
    }

    b = SSxy / SSxx;
    a = y_mean - b*x_mean;
}

template<typename T, typename Out_Type>
inline Out_Type Preprocess<T, Out_Type>::SumOfSquareDiffs(Out_Type x, indexType n)
{
    return n*x*x + n*(n - 1)*x + ((n - 1)*n*(2 * n - 1)) / static_cast<Out_Type>(6);
}

template<typename T, typename Out_Type>
Out_Type Preprocess<T, Out_Type>::LinResidualSumOfSquareDist(indexType beginPoint) 
{
    Out_Type sumOfSquares = 0;
    Out_Type temp;

    for (indexType i = 0; i < numberOfPoints; ++i) {
        temp = fitWave[i + beginPoint] - (a + b*fitWave_X[i + beginPoint]);
        sumOfSquares += temp*temp;
    }
    return sumOfSquares;
}


template<typename T, typename Out_Type>
inline void Preprocess<T, Out_Type>::SetSpectrum(std::vector<T>* input, T laserPeakThreshold, std::vector<Out_Type>* processedSpectrum)
{
    this->input = input;
    fitWave_X.resize(input->size());
    fitWave.resize(input->size());
    this->laserPeakThreshold = laserPeakThreshold;
    this->processedSpectrum = processedSpectrum;
    processedSpectrum->resize(input->size());
}

This is very much architecture-dependent, but yes: if your float_type's mantissa equals sizeof( int_type ), you might expect speedup on *some* architectures. — lorro, Jun 22 '16 at 13:36
You can get quite a bit of insight by examining the generated assembly. Why don't you share the code? Perhaps someone might go to the trouble of doing that — Smeeheey, Jun 22 '16 at 13:39
@PeteBecker Ok, that version also results in the float version being faster. That speaks against my reasoning. — Paul R., Jun 22 '16 at 13:39
@tobi303 /O2 on VC2015 all other flags are the default flags of visual studio 2015. — Paul R., Jun 22 '16 at 13:41
@Smeeheey I managed to produce a (still large for SO) piece of code that reproduces the effect. — Paul R., Jun 22 '16 at 14:30
@Smeeheey Sorry for that. Nevertheless, the code is portable except for the QueryPerformance counter stuff used for the timing. The #pragma once could be changed to a header guard define and #include "stdafx.h" can be removed if not using visual studio. — Paul R., Jun 22 '16 at 14:57
Side Note: Do not use vector::at in performance critical sections (I personally, never use vector::at). — , Jun 22 '16 at 15:32
@DieterLücking You're right, I never used it before and didn't notice that it checks for out of bounds indices. — Paul R., Jun 24 '16 at 07:07

score 0 · Answer 1 · answered Jun 22 '16 at 15:32

You are using MSVC? I had a similar effect when I implemented code that essentially was a matrix-multiplication plus a vector addition. Here, I thought that floats would be faster because they can be better SIMD-parallelized as more can be packed in the SSE registers. However, using doubles was much faster.

After some investigation, I figured out from the assembler code that the float's need conversion from the internal FPU precision and this rounding was consuming most of the runtime. You can change the FP model to something that is faster with the cost of reduced precision. There is also some discussion in older threads here at SO.

I understand that floats are not necessarily faster than doubles. What confuses me is the gain in execution speed of about 30 % just by changing the type of the indices from unsigned int to int or long or unsigned short. — Paul R., Jun 23 '16 at 07:38

Combining integers and floating point numbers: performance considerations

1 Answers1