I don't understand why almost the same code works faster

Question

I decided to brush up on my memory of sorting algorithms and write a small C++ library containing an implementation of every well-known algorithm. When it came to selection sort, I implemented it like this:

template<class RandomIt, class Compare>
    void selection(RandomIt first, RandomIt last, Compare comp) {
        for (auto it = first; it != last; ++it)
            std::iter_swap(it, std::min_element(it, last, comp));
    }

For the sake of interest, I began to compare the speed of my algorithm with other implementations, which I found on the Internet and found out that the implementation below is more than twice as fast:

void selectionsort_1(int* l, int* r) {
    for (int* i = l; i < r; i++) {
        int minz = *i, *ind = i;
        for (int* j = i + 1; j < r; j++) {
            if (*j < minz) minz = *j, ind = j;
        }
        std::iter_swap(i, ind);
    }
}

I tried to find the reason for this on the internet, but without success. Later, I noticed that storing the value of the minimum element and a pointer to it does not make much sense and changed the code a bit:

void selectionsort_2(int* l, int* r) {
    for (int* i = l; i < r; i++) {
        int* min = i;
        for (int* j = i + 1; j < r; j++) {
            if (*j < *min) min = j;
        }
        std::iter_swap(i, min);
    }
}

To my surprise, the modified algorithm ran in the same amount of time as my implementation, i.e. worse. I decided to see what assembly code the compiler generates for each of the two previous code options and did not notice much difference.

selectionsort_1

selectionsort_2

Perhaps I came to the wrong conclusion, but is it really a difference in one mov command that can slow down or speed up the code twice?

P.S. Below I attached an example of test results on an array of random integer numbers with a dimension of 100,000 elements. Project configuration: Release.

Remember to enable your compilers optimizer for even greater speedups. — Jesper Juhl, Nov 17 '22 at 23:12
Please provide a complete [mre] that we can compile ourselves (including how you measure the time taken) — UnholySheep, Nov 17 '22 at 23:12
Where is code doing measurements? Are you aware that performance measurement is hard? Is is full of traps, since is AS IF rule can transform code is such way that you do not measure thing you wish to measure. — Marek R, Nov 17 '22 at 23:12
https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-errors-when-asking-a-question — Marek R, Nov 17 '22 at 23:14
@JesperJuhl thanks for the answer! However, the optimizer was enabled from the start — kanashi, Nov 17 '22 at 23:20
One possible reason can be that derefencing the `min` pointer can cause a cache miss if the pointed to element is far enough away from what `j` points to at the moment. Though given enough iterations the cost of the pointer dereferencing could also become a factor — UnholySheep, Nov 17 '22 at 23:20
You say it's release but those codes do an awful lot of local variable accessing on the stack for an optimized build. — Jester, Nov 17 '22 at 23:21
What is the exact command line used to compile the code? And which compiler, on what platform? — Eljay, Nov 17 '22 at 23:29
Can't reproduce [clang](https://quick-bench.com/q/ZnAqwU9vTK-4E14nMk9jySnZLeE) [gcc](https://quick-bench.com/q/RUb1sT31x5c5jtYx-zjjpXFQr4M). — Marek R, Nov 17 '22 at 23:36
@MarekR I use Microsoft Visual C++ compiler. Language standard is C++20 — kanashi, Nov 17 '22 at 23:41
I cannot reproduce your results on my local Windows machine either, `selectionsort_2` is slower, but `sort::selection` is almost 6 times faster than `selectionsort_1` when I run in release mode (running on the latest update of MSVC2022) — UnholySheep, Nov 17 '22 at 23:49
What is the exact command line used to compile the code? And which version of Microsoft Visual C++ compiler? On Windows 32-bit, or 64-bit? — Eljay, Nov 18 '22 at 00:25
@UnholySheep Before this I was using MSVC2019. Now installed the latest version of MSVC2022, and everything works as you wrote, thanks! The reason for the difference in the execution time of the functions seems to be that you indicated earlier — kanashi, Nov 18 '22 at 00:31
@Eljay Command line: /permissive- /ifcOutput "Release\" /GS /GL /analyze- /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /sdl /Fd"Release\vc143.pdb" /Zc:inline /fp:precise /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /Gd /Oy- /Oi /MD /std:c++20 /FC /Fa"Release\" /EHsc /nologo /Fo"Release\" /Ot /Fp"Release\test.pch" /diagnostics:column Version: now MSVC2022 Windows 32-bit — kanashi, Nov 18 '22 at 00:39
@Eljay After replacing the compiler with MSVC2022, the algorithm with iterators really began to work much faster, however, selection _1 and selection _2 still have different running times. I also tried the assembly under Windows 32-bit and 64-bit — kanashi, Nov 18 '22 at 00:40
Welcome to SO! Please read the [ask] help page. For one thing, I think the title here could use improvement. — starball, Nov 18 '22 at 01:48
Your screenshots of disassembly are from debug builds. They're keeping local pointer variables in stack memory, so `*j` is `mov eax, [ebp - 2Ch]` / `mov ecx, [eax]`, for example. I don't know what all the MSVC options mean, but `/O2`, or `/Ox` enable optimization. The version using STL functions being slowest is what you'd expect in a non-optimized build, since the functions wouldn't inline. But your question doesn't say anything about compilers or versions or options; [edit] to update with info from comments. — Peter Cordes, Nov 18 '22 at 10:41
See also [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987). And to some degree [What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?](https://stackoverflow.com/q/51607391) but that's mostly about less branchy code, other than the fact that It's Complicated. — Peter Cordes, Nov 18 '22 at 10:43

I don't understand why almost the same code works faster

0 Answers0