6

I was experimenting with different integer types in Visual Studio project in Windows using a simple exchange sort algorithm below. The processor is Intel. The code was compiled in Release x64. The optimization setting is "Maximize Speed (/O2)". The command line corresponding to the compilation settings is

/permissive- /GS /GL /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /sdl /Fd"x64\Release\vc141.pdb" /Zc:inline /fp:precise /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /Gd /Oi /MD /Fa"x64\Release\" /EHsc /nologo /Fo"x64\Release\" /Fp"x64\Release\SpeedTestForIntegerTypes.pch" /diagnostics:classic 

The code itself:

#include <ctime>
#include <vector>
#include <iostream>

void sort(int N, int A[], int WorkArray[]) // exchange sort
{
    int i, j, index, val_min;
    for (j = 0; j < N; j++)
    {
        val_min = 500000;
        for (i = j; i < N; i++)
        {
            if (A[i] < val_min)
            {
                val_min = A[i];
                index = i;
            }
        }
        WorkArray[j] = A[j];
        A[j] = val_min;
        A[index] = WorkArray[j];
    }
}

int main()
{
    std::vector<int> A(400000), WorkArray(400000);
    for(size_t k = 0; k < 400000; k++)
        A[k] = 400000 - (k+1);

    clock_t begin = clock();

    sort(400000, &A[0], &WorkArray[0]);

    clock_t end = clock();
    double sortTime = double(end - begin) / CLOCKS_PER_SEC;
    std::cout << "Sort time: " << sortTime << std::endl;
    return 0;
}

The WorkArray is only needed to save the vector before sorting. The point is, this sorting took me 22.3 seconds to complete. The interesting part is that if I change type int to size_t for arrays A, WorkArray (both in std::vector and in the argument list of function sort), as well as for val_min, the time increases to 67.4! This is threefold slower! The new code is below:

#include <ctime>
#include <vector>
#include <iostream>

void sort(int N, size_t A[], size_t WorkArray[]) // exchange sort
{
    int i, j, index;
    size_t val_min;
    for (j = 0; j < N; j++)
    {
        val_min = 500000U;
        for (i = j; i < N; i++)
        {
            if (A[i] < val_min)
            {
                val_min = A[i];
                index = i;
            }
        }
        WorkArray[j] = A[j];
        A[j] = val_min;
        A[index] = WorkArray[j];
    }
}

int main()
{
    std::vector<size_t> A(400000), WorkArray(400000);
    for(size_t k = 0; k < 400000; k++)
        A[k] = 400000 - (k+1);

    clock_t begin = clock();

    sort(400000, &A[0], &WorkArray[0]);

    clock_t end = clock();
    double sortTime = double(end - begin) / CLOCKS_PER_SEC;
    std::cout << "Sort time: " << sortTime << std::endl;
    return 0;
}

Note that I still keep type int for function local variables i, j, index, N, and so the only two arithmetical operations that are i++ and j++ should take the same amount of time to perform in both cases. Therefore, this slowdown has to do with other reasons. Is it related to the memory alignment issue or register sizes or something else?

But the most outrageous part was when I changed int to unsigned int. Both unsigned int and int occupy the same number of bytes which is 4 (sizeof showed that). But the runtime for unsigned int was 65.8 s! While the first outcome was somewhat ok to accept, the second one totally confuses me! Why is there such a significant difference in time it takes to run such a simple algorithm that does not even involve sign checks?

Thanks to all addressing both of these questions. Where can I start reading more about these hardware-level optimization peculiarities? I don't care about the sorting algorithm itself, it's here for illustration of the problem only.

UPDATE: once again, I stress the fact that I use ints for array indices in all three cases.

MajinSaha
  • 188
  • 1
  • 9
  • 3
    When you change the types, do you change it everywhere or only in main? Could you post 2 versions of the code which behave differently to answer my question? – Benjamin Barrois Mar 16 '18 at 17:34
  • Print out the assembly language for the different cases. Are there any significant changes? – Thomas Matthews Mar 16 '18 at 17:35
  • 4
    What are your optimization settings? In general, benchmarking should only be applied to released, non-debug, builds. – Thomas Matthews Mar 16 '18 at 17:35
  • Something doesn't smell right. The `unsigned int` and `size_t` are the same size on many platforms. Loading and storing use the same instructions regardless of whether the data type is signed or unsigned. However, the compiler may need to perform a signed or unsigned conversion when loading constants. Add the "U" suffixes to your constants for unsigned variables (including `size_t`). – Thomas Matthews Mar 16 '18 at 17:38
  • 1
    No repro on GCC via Coliru. Also, there's a warning: `main.cpp:23:16: warning: 'index' may be used uninitialized in this function [-Wmaybe-uninitialized] A[index] = WorkArray[j];` Unfortunately, I don't have MSVC in front of me to test this atm. – Mysticial Mar 16 '18 at 17:38
  • @ThomasMatthews - unsigned int is 32-bit on x64 MSVC and size_t is 64-bit – zzxyz Mar 16 '18 at 17:38
  • @zzxyz: `unsigned int` and `size_t` are 32-bit on ARM Cortex A8. – Thomas Matthews Mar 16 '18 at 17:40
  • @ThomasMatthews - Since he didn't mention clang or Android, and did mention Visual Studio, it's probably safe to guess he's targetting Windows (and Intel) – zzxyz Mar 16 '18 at 17:41
  • @BenjaminBarrois edited. – MajinSaha Mar 16 '18 at 17:42
  • @ThomasMatthews sorry, don't know how to print assembly for C++ code yet. – MajinSaha Mar 16 '18 at 17:43
  • @ThomasMatthews Signed vs. unsigned *can* matter since signed will allow the compiler to do certain optimizations that assume no integer overflow. But I cannot determine if that would be the case here. (MSVC isn't known to make these types of optimizations anyway.) – Mysticial Mar 16 '18 at 17:47
  • @ThomasMatthews I added "U" symbol, there was no difference in the result, still too slow. – MajinSaha Mar 16 '18 at 17:57
  • 1
    Tested the code on my machine. 42s for int, 41s for unsigned int. Was your machine otherwise idle when you tested both scenarios? (I'm uninterested in size_t as that's kind of a "duh it's twice as much data" thing to me.) – zzxyz Mar 16 '18 at 17:58
  • Attempted to Godbolt it; no `main` to prevent inlining. Warning: I don't use MSVC, so don't trust my flags. `size_t`: https://godbolt.org/g/EqS4D9 `int`: https://godbolt.org/g/LyJrcU – Stephen Newell Mar 16 '18 at 17:59
  • @zzxyz I did this testing around ten times, and around x3 times difference was always observed on my machine. :( – MajinSaha Mar 16 '18 at 18:01
  • @zzxyz: that's what I'd have expected too. OP, are you sure you didn't forget to replace an "int" by an "unsigned int" somewhere in your original code which forces a lot of conversions? – Benjamin Barrois Mar 16 '18 at 18:01
  • Not reproducible. Perfectly identical times for `int` and `unsigned int` in VS2017. Moreover, `22.3 seconds` for `int` version is unrealistic. It should take longer. I suspect that `int` test was botched. – AnT stands with Russia Mar 16 '18 at 18:02
  • @MajinSaha: Have you compiled in debug or release mode? WIth any optimization flag? – Benjamin Barrois Mar 16 '18 at 18:02
  • 1
    @StephenNewell That's pretty telling. MSVC is vectorizing with `size_t` but not `int`. But `int` vs. `unsigned` are almost identical. – Mysticial Mar 16 '18 at 18:04
  • @MajinSaha - Go to project properties, make sure Release x64 is selected, go to C/C++->Command Line. Paste that command line into your question. – zzxyz Mar 16 '18 at 18:05
  • @zzxyz posted the command line – MajinSaha Mar 16 '18 at 18:07
  • @AnT I have a powerful office machine, 16 cores (though that shouldn't matter in this case?) – MajinSaha Mar 16 '18 at 18:09
  • @MajinSaha - Your disassembly looks completely unlike mine. If you set a breakpoint and hit F5, you can right click on the code and "go to disassembly". That's provided you have basic debug info, which VS has even for Release builds by default. – zzxyz Mar 16 '18 at 18:36
  • @zzxyz it is different from yours because I commented out "seed" function, since it is not involved in time measurement during runtime, it is just used for initialization of the array. I'll try to post the disassembly directly from VS. – MajinSaha Mar 16 '18 at 18:48
  • @MajinSaha - No, it's *entirely* different. Almost unrecognizable. – zzxyz Mar 16 '18 at 18:50
  • 1
    @zzxyz The disassembly looks terrifying in VS. I'm afraid this may cause unwanted confusion. It's better if readers generate it locally on their VS. Some readers managed to reproduce the described behavior. – MajinSaha Mar 16 '18 at 18:53

2 Answers2

10

Inspecting the generated assembly for all 3 variants (int, unsigned, size_t), the big difference is that in the int case the loop in the sort function is unrolled and uses SSE instructions (working on 8 ints at a time), while in the other 2 cases it does neither. Interestingly enough, the sort function is called in the int case, while it is inlined into main in the other two (likely due to the increased size of the function due to the loop unrolling).

I'm compiling from the command line using cl /nologo /W4 /MD /EHsc /Zi /Ox, using dumpbin to get the disassembly, with toolset Microsoft (R) C/C++ Optimizing Compiler Version 19.12.25830.2 for x64.

I get execution times of around 30 seconds for int and 100 seconds for the other two.

1201ProgramAlarm
  • 32,384
  • 7
  • 42
  • 56
  • Thanks! Do you have any clue why it doesn't unroll for `unsigned int` considering it's the same size as `int`? – MajinSaha Mar 16 '18 at 20:09
  • 1
    That's the sad thing about closed-source compilers, unless you write it and violate the confidentiality agreement your employer, there is no way for anyone else to know why the compiler is doing what it does with specific optimizations. – David C. Rankin Mar 16 '18 at 20:13
  • 1
    Probably because there's an SSE instruction (`pcmpgtd`) to do a signed compare, but not one for an unsigned compare. – 1201ProgramAlarm Mar 16 '18 at 20:14
  • @1201ProgramAlarm - Interesting. Mine never generates this instruction or anything like it, although it uses other AVX/SSE instructions. – zzxyz Mar 16 '18 at 23:40
  • @1201ProgramAlarm - And VS 2015 and 2017 fairly different assembly (and 25s vs. 40s on 2015). – zzxyz Mar 17 '18 at 00:17
6

I tried this code in VS2017. I succeeded in reproducing.

I modified the code as follows so that the time is almost the same.

The cause seems to be due to the implicit casting of the array index.

#include <ctime>
#include <vector>
#include <iostream>

using namespace std;

// exchange sort
template<typename elem_t, typename index_t>
void sort(index_t size, elem_t* a, elem_t* b)
{
    index_t index = 0, i, j;
    elem_t min;

    for (j = 0; j < size; j++)
    {
        min = 500000;
        for (i = j; i < size; i++)
        {
            if (a[i] < min)
            {
                min = a[i];
                index = i;
            }
        }
        b[j] = a[j];
        a[j] = min;
        a[index] = b[j];
    }
}

template<typename elem_t, typename index_t, index_t size>
void test() {
    //vector<elem_t> a(size);
    //vector<elem_t> b(size);

    elem_t a[size];
    elem_t b[size];

    for (index_t k = 0; k < size; k++)
        a[k] = (elem_t)(size - (k + 1));

    clock_t begin = clock();
    sort(size, &a[0], &b[0]);
    clock_t end = clock();

    double sortTime = double(end - begin) / CLOCKS_PER_SEC;
    cout << "Sort time: " << sortTime << endl;
}

int main()
{
    const int size = 40000;

    cout << "<size_t, int>" << endl;
    test<size_t, int, size>();
    cout << endl;

    cout << "<size_t, size_t>" << endl;
    test<size_t, size_t, size>();
    cout << endl;

    cout << "<int, int>" << endl;
    test<int, int, size>();
    cout << endl;

    cout << "<int, size_t>" << endl;
    test<int, size_t, size>();
    cout << endl;

    cout << "<uint, int>" << endl;
    test<unsigned int, int, size>();
    cout << endl;

    cout << "<uint, size_t>" << endl;
    test<unsigned int, size_t, size>();
    cout << endl;

    cout << "<uint, uint>" << endl;
    test<unsigned int, unsigned int, size>();
    cout << endl;
}

Personally, I do not like implicit casting. For troubleshooting this kind of problem, increase the warning level to the maximum, and resolve all warnings, and then convert to generic code. This will help you identify the problem.

The result of this code appears as a result of various combinations.

Minos
  • 218
  • 2
  • 7
  • The phrase "Almost always auto" just keeps becoming more and more relevant – BTownTKD Mar 16 '18 at 18:19
  • How did you manage to reduce run times to 0.67 sec in all three cases!!!!!??????? But thanks for your great answer. I'll start digesting it slowly. – MajinSaha Mar 16 '18 at 18:20
  • @MajinSaha Decreased test size from `400000` to `40000`. – Algirdas Preidžius Mar 16 '18 at 18:21
  • Oh, I see. Sorry, I didn't notice. – MajinSaha Mar 16 '18 at 18:22
  • 1
    Re: "I do not like implicit casting" -- there is no such thing. A cast is something you write in your source code to tell the compiler to do a conversion. That is, it's always explicit. **Conversions** can be implicit. – Pete Becker Mar 16 '18 at 18:24
  • So in your case I now have around 67 seconds for all three. Despite the fact you managed to reach equal runtimes for all three cases, this doesn't explain why it took me 22 seconds for bare int alone. Note that I was using int in ALL of the cases for array indices. – MajinSaha Mar 16 '18 at 18:27
  • @Minos, I tested your example, and it seems to behave exactly, as it did for OP (that is: nothing seems to be fixed): `size_t` test case takes `97.936` seconds, `int` test case `27.149`, and `unsigned int` takes `94.835` seconds. Before you ask: yes, I am building in x64 Release mode with full program optimization enabled. – Algirdas Preidžius Mar 16 '18 at 18:27
  • In fact, it was not the right word choice for this code. However, I came up with a solution to explain the possibility of invisible machine code changes. – Minos Mar 16 '18 at 18:28