4

I have an C++11 application where I commonly iterate over several different structure of arrays for various algorithms. Raw CPU performance is important for this app.

The array elements are fundamental types (int, double, ..) or simple struct. The array are typically tens of thousands of elements long. I often need to iterate several arrays at once in a given loop. So typically I would need one pointer for each array of whatever type. So times I need to increment five individual pointers which is verbose.

Based on these answers about tuples, Why is std::pair faster than std::tuple C++11 tuple performance I hoped there was no overhead to using tuples to pack the pointers together into a single object.

I thought it might be nice to implement a cursor like object to assist in iterating, since missing the increment on a particular pointer would be an annoying bug.

auto pts = std::make_tuple(p1, p2, p3...);

allow you to bundle a bunch of variables together in a typesafe way. Then you can implement a variadic template function to increment each pointer in the tuple in a type safe way.

However...

When I measure performance, the tuple version was slower then using raw pointers. But when I look at the generated assembly I see additional mov instructions in the tuple loop increment. Maybe due to the fact the std::get<> returns a reference? I had hoped that would be compiled away...

Am I missing something or are raw pointers just going to beat tuples when used like this? Here is a simple test harness. I threw away the fancy cursor code and just use a std::tuple<> for this test

On my machine, the tuple loop is consistently twice as slow as the raw pointer version for various data sizes.

My system config is Visual C++ 2013 x64 on Windows 8 with a release build. I did try turning on various optimization in Visual Studio such as Inline Function Expansion : Any Suitable (/Ob2) but it did not seem to change the time result for my case.

I did need to do two extra things to avoid aggressive optimization by VS
1) I forced the test data array to allocated on the heap, not the stack. That made a big difference when I timed things, possibly due to memory cache effects.
2) I forced a side effect by writing to static variable at the end so the compiler would not just skip my loop.

struct forceHeap
{
    __declspec(noinline) int* newData(int M)
    {
        int* data = new int[M];
        return data;
    }    
};

void timeSumCursor()
{
    static int gIntStore; 
    int maxCount = 20;
    int M = 10000000; 
    // compiler might place array on stack which changes the timing
    // int* data = new int[N];         
    forceHeap fh;
    int* data = fh.newData(M);
    int *front = data; 
    int *end = data + M;

    int j = 0;
    for (int* p = front; p < end; ++p)
    {
        *p = (++j) % 1000;
    }

    {
        BEGIN_TIMING_BLOCK("raw pointer loop", maxCount);
        int* p = front;
        int sum = 0;
        int* cursor = front;
        while (++cursor != end)
        {
            sum += *cursor;
        }
        gIntStore = sum;// force a side effect
        END_TIMING_BLOCK(); 
    }
    printf("%d\n", gIntStore);

    {
        // just use a simple tuple to show the issue 
        // rather full blown cursor object
        BEGIN_TIMING_BLOCK("tuple loop", maxCount);
        int sum = 0;
        auto cursor = std::make_tuple(front);
        while (++std::get<0>(cursor) != end)
        {
            sum += *std::get<0>(cursor);
        }
        gIntStore = sum; // force a side effect
        END_TIMING_BLOCK();
    }
    printf("%d\n", gIntStore);

    delete[] data;
}
Community
  • 1
  • 1
meissnersd
  • 1,272
  • 2
  • 12
  • 21
  • First issue: always time separately, not one after another. `maxCount` does not seem to be used? One of the `int* p = front;` does not seem to be used? – Yakk - Adam Nevraumont Jan 28 '16 at 21:46
  • This is a snippet taken from a test harness we use. BEGIN_TIMING_BLOCK(label, passes) is a macro which loops over the block passes number of times. This helps average out noise from the system. So maxCount sets the number of passes. When I did my test, I tried many variations in different orders. On my system, changing the order of the test had no effect on the result. I left out the variations and just posted the snippet you see for clarity. – meissnersd Jan 29 '16 at 00:32
  • And yes ```int* p = front;``` is cruft that should have been deleted. Doing so does not affect the result. – meissnersd Jan 29 '16 at 00:33
  • 2
    So an issue is that it is harder for us to reproduce your problem when the code posted is incomplete and/or contains redundant stuff. If you post a complete example that can be copy/pasted and reproduce your problem, you'll get better results from a SO question. Good luck! – Yakk - Adam Nevraumont Jan 29 '16 at 01:59

0 Answers0