How can this for loop be optimized to run faster without parallelizing or SSE?

Question

I am trying to optimize a piece of code without resorting to parallelizing / SSE.

Current critical code runs in about 20ms on my PC with O2. That seems quite a bit even for ~17mil iterations. The particular piece that is too slow is as follows:

    for (int d = 0; d < numDims; d++)
    {
        for (int i = 0; i < numNodes; i++)
        {
            bins[d][(int) (floodVals[d][i] * binSteps)]++;
        }
    }

Update: Changing to iterators reduced the run-time to 17ms.

    for (int d = 0; d < numDims; d++)
    {
        std::vector<float>::iterator floodIt;
        for (floodIt = floodVals[d].begin(); floodIt < floodVals[d].end(); floodIt++)
        {
            bins[d][(int) (*floodIt * binSteps)]++;
        }
    }

The full dummy code is here:

#include <vector>
#include <random>
#include <iostream>
#include <chrono>

int main()
{
    // Initialize random normalized input [0, 1)
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_real_distribution<float> dist(0, 0.99999);

    // Initialize dimensions
    const int numDims = 130;
    const int numNodes = 130000;
    const int binSteps = 30;

    // Make dummy data
    std::vector<std::vector<float>> floodVals(numDims, std::vector<float>(numNodes));

    for (int d = 0; d < numDims; d++)
    {
        for (int i = 0; i < numNodes; i++)
        {
            floodVals[d][i] = dist(gen);
        }
    }

    // Initialize binning
    std::vector<std::vector<int>> bins(numDims, std::vector<int>(binSteps, 0));

    // Time critical section of code
    auto start = std::chrono::high_resolution_clock::now();

    for (int d = 0; d < numDims; d++)
    {
        for (int i = 0; i < numNodes; i++)
        {
            bins[d][(int) (floodVals[d][i] * binSteps)]++;
        }
    }

    auto finish = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> elapsed = finish - start;
    std::cout << "Elapsed: " << elapsed.count() * 1000 << " ms" << std::endl;

    return 0;
}

What were the results of your sessions with a profiler? What kind of results did you get when you switched to using iterators instead of subscripting? What were the results when compiling -O3? — Captain Obvlious, Jan 04 '23 at 23:31
Since each vector in both of your vector-of-vectors has the same length, one thing to try is to replace them with single flat vectors. This removes indirections and possibly has better cache locality. — Eric M Schmidt, Jan 04 '23 at 23:42
@CaptainObvlious I would say there is not much to profile here. I did try to remove the multiplication, but that had negligible effect. The expensive line is the only line that's in the loop and there's not much to change about it, but perhaps another way of doing it would be possible. Switching to iterators is an interesting idea. I'll update the question with the results. I'm compiling using cl, which seems to not have an O3 option. — Taylee, Jan 04 '23 at 23:43
@EricMSchmidt It does seem a little faster if I can somehow avoid division to find out which set of bins I have to be in. for (int i = 0; i < floodVals.size(); i++) bins[(i / numNodes) * binSteps + (int) (floodVals[i] * binSteps)]++; — Taylee, Jan 04 '23 at 23:55
@harold Yes sorry, I handled it in my actual code, not the dummy code because it was a bit irrelevant. Will change the generator to only generate until 0.9999 in the question. — Taylee, Jan 04 '23 at 23:57
I tried the usual "unroll using several histograms" trick to reduce dependencies through memory (one of the non-SIMD discussed by Peter [here](https://stackoverflow.com/a/39271518/555045)) but the effect was very small. — harold, Jan 05 '23 at 00:03
Casting to `long` instead of `int` shaves off one sign-extend instruction. Otherwise gcc's generated code looks fine, with no obvious inefficiencies. — Nate Eldredge, Jan 05 '23 at 00:15
If you do not specify any particular flag to the compiler, then it will generate a SSE2 code on any mainstream x86-64 CPU. In this case, it will typically use the `cvttss2si` instruction. SSE is the modern way way to do float-based computation since at least 2 decades. The old way is inefficient. By the way, at 3 GHz for example, ~17mil iterations means less than 4.5 cycle per iteration. This is not much since you need 2 read, 1 store, an increment, a scalar floating-point multiplication and conversion... — Jérôme Richard, Jan 05 '23 at 00:18
One thought is to replace your floating-point math with integer. Instead of populating `floodVals` with floats between 0 and 1, scale it to integers between 0 and some convenient power of 2, let's say `[0..65535]`. Then the array index is computed via `(floodVals[d][i] * binSteps) / 65536`. Instead of a floating-point multiply and a float-to-integer conversion per iteration, you now have an integer multiply and a right shift. It also fixes your bug where you possibly overrun the last element. — Nate Eldredge, Jan 05 '23 at 00:19
(Just make sure to choose a power of 2 that will not overflow when multiplied by `binSteps`.) — Nate Eldredge, Jan 05 '23 at 00:24
@NateEldredge It seems to run just a tad slower than the float version actually. — Taylee, Jan 05 '23 at 00:27
On my i5-9600KF CPU, the loop takes 2.7 cycle/iter with GCC 10.3. A manual unrolling results in 2.5 cycle/iter. This is very good for a scalar code (especially regarding the latency of the operations starting from `cvttss2si` and the loads/stores in the L1 cache). If you want something fast, then you need to specifically optimize the code for a given CPU architecture. The code is apparently issue-bound (too many instructions to execute per cycle). — Jérôme Richard, Jan 05 '23 at 01:03
why on earth do you want a non-parallel non-SIMD code? Is it just a challenge? — phuclv, Jan 05 '23 at 01:04
@phuclv: I'm not sure how SIMD would help, until you get to AVX-512 and scatter/gather. Parallelizing would help for sure though, this problem is embarrassingly parallel: just feed each thread a different range of values for `d`. — Nate Eldredge, Jan 05 '23 at 01:09
@phuclv Well the code is trivially parallelized, so that will happen anyway but is not the point of the question. If it's faster before parallelization I expect it to be even faster when parallelized. As for SIMD, it might be a thing to resort to if it ends up not being fast enough with optimizations and parallelization, but then I end up in a hellhole of CPU compatibility. — Taylee, Jan 05 '23 at 01:11
SIMD has been [successfully used for similar tasks](http://fastcompression.blogspot.com/2014/09/counting-bytes-fast-little-trick-from.html) (at the end, Nathan Kurz's version) even before AVX512 though not in an obvious manner (just to reduce the number of loads really, replacing a bunch of them with a vector load and extract-from-vector instructions) — harold, Jan 05 '23 at 01:17
There are different AVX512 approaches than scatter/gather/`vpconflictd` too, for example one that uses `vgf2p8affineqb` for a bit-level transpose and then uses bit-sliced shifts and `vpopcntq` to count how many ones end up in each bin. — harold, Jan 05 '23 at 01:29
This line has you jumping, randomly, all over the place in memory `bins[d][(int) floodVals[d][i] * binSteps)]++;` If you want code to run as fast as possible, your memory access needs to be predictable. — Taekahn, Jan 12 '23 at 15:09

score 1 · Answer 1 · answered Jan 05 '23 at 00:59

1

Try eliminating indexing on d in the inner loop, since it is constant in the inner loop anyway. This was roughly 2x faster for me.

    for (int d = 0; d < numDims; d++)
    {
        int* const bins_d = &bins[d][0];
        float* const floodVals_d = &floodVals[d][0];
        for (int i = 0; i < numNodes; i++)
        {
            bins_d[(int) (floodVals_d[i] * binSteps)]++;
        }
    }

answered Jan 05 '23 at 00:59

Wyck

10,311
6
39
60

1

What compiler and optimization options was this with? With gcc 12.2 `-O2`, it looked to me like it was already doing that optimization. Though that was from reading the generated assembly rather than testing it. – Nate Eldredge Jan 05 '23 at 01:03
It indeed seems to run in about 17ms on VC19. Interestingly it's faster than the iterator code above. – Taylee Jan 05 '23 at 01:07
2

Both GCC 10.3 and Clang 15.0 generate the same loop assembly code than the one of the OP using only the compilation flag `-O2`. It would be surprising for a compiler not to do such basic optimization... However, MSVC 19.33 generates a really bad assembly code for the initial OP code compared to this one using `/O2` (it reload the same values from the memory to registers in the hot loop which is insane/stupid). With this code, MSVC generate a code similar to the two other compilers. – Jérôme Richard Jan 05 '23 at 01:14
After removing a useless flag that had sneaked in, it runs in 17ms compared to 20ms of the original code, but the iterator code also runs at 17ms. – Taylee Jan 05 '23 at 01:23
1

@JérômeRichard I think the compiler needs to assume strict aliasing rules to be able to do this optimization. MSVC, AFAIK, does not implement that. So writing into `bins` may change the data pointer in the vector objects -- thus MSVC reloads them on each iteration. – Yakov Galka Jan 05 '23 at 01:36

How can this for loop be optimized to run faster without parallelizing or SSE?

1 Answers1