2

I have a row-major iterator to a 2D array, with derefence operator as follows:

int& Iterator::operator*(){ return matrix_[y_][x_]; }  //matrix_ has type int**

The (prefix) increment operator is as follows:

Iterator& Iterator::operator++()
{
    if((++x_ == xs_) && (++y_ != ys_)) //ys_, xs_ are the dimensions of the matrix
        x_ = 0;
    return *this;   
} 

I can use this iterator with an optimised version of std::transform (doesn't return an un-needed result, in order to save a few instructions)

template < class InputIterator, class OutputIterator, class UnaryOperator >
inline void MyTransform( InputIterator first1, InputIterator last1,
                         OutputIterator result, UnaryOperator op )
{
    for (; first1 != last1; ++first1, ++result)
        *result = op(*first1);
} 

calling it thus:

MyTransform(matrix1.begin(),matrix1.end(),matrix2.begin(), MyFunctor());

However, when I compare the performance to a classic, nested for-loop:

MyFunctor() f;
for (int y=0; y<ySize; y++)
    for (int x=0; x<xSize; x++)
        matrix2.[y][x] = f(matrix1.[y][x]);

The iterator-based solution is approx. 25% slower than the nested for-loop solution. This is the case for both MSVC and Intel C++ compilers (both of which seem to auto-inline as needed).

Now the problem does not seem to be with the iterator increment operator, as if I do the following (ugly) hybrid solution combining iterator-traversal and raw array access (the latter indexed using the iterators' internal counts):

MyFunctor f;
for (; mat1Begin != mat1End; ++mat1Begin, ++mat2Begin)
{ 
    //mat1 and mat2 are type int**
    mat2[mat2Begin.y_][mat2Begin.x_] = f(mat1[mat1Begin.y_][mat1Begin.x_]);
}

it is actually a little faster than the nested for-loop solution. This suggests to me that the performance hit is in the iterator's dereferencing when doing the assignment.

My question is, why does dereferencing the iterators in the assignment

*result = op(*first1);

incur such a massive performance hit, relative to raw array access? Is there any technique I can use for this simple design, to get performance (almost) equivalent to the raw array version?

In response to helpful feedback from this community, I have modified the code so that the outer counter of the loop is cached, so the code now looks as follows:

int& Iterator::operator*()
{
    return column_[x_]; 
} 

Iterator& Iterator::operator++()
{
    if(++x_ == xs_)      //ys_, xs_ are the dimensions of the matrix
    {
        if(++y_ != ys_)
        { 
            x_ = 0;
            column_ = matrix_[y_];
        }
    }
    return *this;
} 

This improves performance to ~85% of the raw 2D array performance for the Intel C++ Compiler, and similar for MSVC compiler (actually the call to the MyTransform is slower on MSVC - much more assembly instructions generated - but let's ignore that for now as I am interested more in the loop/dereference behaviour).

When I convert the code to using pointer arithmetic (again caching the column) as follows, the performance is significantly worse than the raw 2D array (~70%) on the Intel compiler, but again ~85% of raw 2D array under MSVC compiler

int& Image32Iterator::operator*()
{
    return *ptr_;
} 

//prefix
Image32Iterator& Image32Iterator::operator++()
{
    if(++ptr_ == ptrEnd_)
    {
        if(++row_ != rowEnd_)
        { 
            ptrEnd_ = (ptr_ = *row_) + xs_;
        }
    }
    return *this;
} 

So I am trying to understand if ~85% performance is the maximum I can get using the iterator-based solution. I am surprised that the pointer arithmetic solution performs so much worse (galling as I was trying to use pointer arithmetic to see if I could get > 85%!).

I'll continue investigations and update with finds, but any insight welcome...

...So, focussing on the issue of why the pointer-arithmetic version of the iterator performs so badly for Intel, whereas it performs okay for the MSVC compiler, I have looked at the assembly, and the problem seems to be in the code generated for the loop. For all the other functions (i.e., the constructors, the iterator and dereference operators, inequality operator etc., the generated code is pretty much the same for both Intel and MSVC, if anything it is slightly more concise for the Intel).

Here is the assembler for the Intel generated code, followed by the assembler for the MSVC generated code. I have changed from a for loop to a while loop to make the generated assembler easier to read:

Intel generated code:

while(begin != end)
01392D31  push        eax  
01392D32  lea         eax,[begin]  
01392D35  lea         edx,[end]  
01392D38  mov         dword ptr [esp],edx  
01392D3B  mov         ecx,eax  
01392D3D  call        ImageExperiments::Image32Iterator::operator!= (139103Ch)  
01392D42  mov         byte ptr [ebp-74h],al  
01392D45  movzx       eax,byte ptr [ebp-74h]  
01392D49  movzx       eax,al  
01392D4C  test        eax,eax  
01392D4E  je          ImageExperiments::greyscale_iterator2+0BCh (1392DACh)  
{
    *it8 = gsf(*begin);
01392D50  lea         eax,[begin]  
01392D53  mov         ecx,eax  
01392D55  call        ImageExperiments::Image32Iterator::operator* (13910A5h)  
01392D5A  mov         dword ptr [ebp-10h],eax  
01392D5D  push        eax  
01392D5E  lea         eax,[gsf]  
01392D61  mov         edx,dword ptr [ebp-10h]  
01392D64  mov         edx,dword ptr [edx]  
01392D66  mov         dword ptr [esp],edx  
01392D69  mov         ecx,eax  
01392D6B  call        ImageExperiments::GreyScaleFunctor::operator() (139101Eh)  
01392D70  mov         byte ptr [ebp-72h],al  
01392D73  movzx       eax,byte ptr [ebp-72h]  
01392D77  mov         byte ptr [ebp-71h],al  
01392D7A  lea         eax,[it8]  
01392D7D  mov         ecx,eax  
01392D7F  call        ImageExperiments::Image8Iterator::operator* (1391050h)  
01392D84  mov         dword ptr [ebp-0Ch],eax  
01392D87  mov         eax,dword ptr [ebp-0Ch]  
01392D8A  movzx       edx,byte ptr [ebp-71h]  
01392D8E  mov         byte ptr [eax],dl  
    ++begin; 
01392D90  lea         eax,[begin]  
01392D93  mov         ecx,eax  
01392D95  call        ImageExperiments::Image32Iterator::operator++ (1391028h)  
01392D9A  mov         dword ptr [ebp-8],eax  
    ++it8;
01392D9D  lea         eax,[it8]  
01392DA0  mov         ecx,eax  
01392DA2  call        ImageExperiments::Image8Iterator::operator++ (1391014h)  
01392DA7  mov         dword ptr [ebp-4],eax  
01392DAA  jmp         ImageExperiments::greyscale_iterator2+41h (1392D31h)  
}
}
00CA2DAC  leave  
00CA2DAD  ret

MSVC generated code:

while(begin != end)
010316E0  lea         eax,[end]  
010316E3  push        eax  
010316E4  lea         ecx,[begin]  
010316E7  call        ImageExperiments::Image32Iterator::operator!= (1031096h)  
010316EC  movzx       ecx,al  
010316EF  test        ecx,ecx  
010316F1  je          ImageExperiments::greyscale_iterator2+74h (1031724h)  
{
    *it8 = gsf(*begin);
010316F3  lea         ecx,[begin]  
010316F6  call        ImageExperiments::Image32Iterator::operator* (10311EAh)  
010316FB  mov         eax,dword ptr [eax]  
010316FD  push        eax  
010316FE  lea         ecx,[gsf]  
01031701  call        ImageExperiments::GreyScaleFunctor::operator() (1031032h)  
01031706  mov         bl,al  
01031708  lea         ecx,[it8]  
0103170B  call        ImageExperiments::Image8Iterator::operator* (1031118h)  
01031710  mov         byte ptr [eax],bl  
    ++begin; 
01031712  lea         ecx,[begin]  
01031715  call        ImageExperiments::Image32Iterator::operator++ (1031041h)  
    ++it8;
0103171A  lea         ecx,[it8]  
0103171D  call        ImageExperiments::Image8Iterator::operator++ (103101Eh)  
}
01031722  jmp         ImageExperiments::greyscale_iterator2+30h (10316E0h)  
}
01031724  pop         edi  
01031725  pop         esi  
01031726  pop         ebx  
01031727  mov         esp,ebp  
01031729  pop         ebp  
0103172A  ret  

So it looks to me like the Intel compiler generates approx. 50% larger number of instructions. I have tried qualifying the pointers with __restrict to see if that would make any difference to the Intel generation but it did not. If anyone has any suggestion as to why the loop code from the Intel compiler is so bulky/slow, compared to the MSVC++ compiler, I'd be very interested!

TPJ
  • 179
  • 1
  • 11
  • I imagine it would be worth comparing the assembler in both cases here. – Alex Wilson Jun 29 '12 at 16:29
  • Also, in your second test with iterators, have you tried replacing this line mat2[mat2Begin.y_][mat2Begin.x_] = f(mat1[mat1Begin.y_][mat1Begin.x_]); with the iterator-based assignment, just to ensure that even after tidying away all the MyTransform infrastructure, the performance is still degraded with the dereferencing? – Alex Wilson Jun 29 '12 at 16:31
  • Yes, I have tried that, it is nothing to do with the transform machinery. – TPJ Jun 29 '12 at 17:04
  • Thanks for suggestion to look at assember, I had started doing that but I'm not so literate at that level and so it is hard going :) I am baffled as to why there is such a hit. I much prefer using iterator but this is for a performance-sensitive bit of code which can't tolerate such a hit. – TPJ Jun 29 '12 at 17:10
  • Though it goes against basic design principles, calling a function in a tight loop when performance is paramount is a bad idea. Check your assembler to see what it's actually doing before changing anything, but if that function call isn't optimized away you may want to remove it completely and measure again. In a tight loop this stuff actually matters, a lot. – Ed S. Jun 29 '12 at 17:13
  • The function call is inlined. – TPJ Jun 29 '12 at 18:26
  • I can't re-create this behaviour in g++ based on your description. I have coded something similar to your description in order to have a look at the assmeber [here](http://ideone.com/bN6ln) but I see the timings for the iterators dominated by the increment logic. And no change in timing from using the operator*. Perhaps I've implemented it incorrectly. – Alex Wilson Jun 29 '12 at 20:11
  • Edited my post, there was some bad science crept in. In fact the MSVC compiled version of the iterator solution performs approx. same in the iterator incrementing/dereferencing as the Intel C++ compiler, so my questions cover both compilers. – TPJ Jul 03 '12 at 08:14

4 Answers4

2

I've had a go at recreating your code, see here.

Running it under g++ (4.6.3, -O3), I find that:

1) The non-iterator version is indeed faster, but in my case by about a factor of 4. 2) The iterator version, whether or not you then deference the iterators or extract their counters and use them to access the array directly, is slower (by the aforementioned factor).

I've captured the assembler for the two versions, and find them to be quite different with a significant amount of code associated with the iterator incrementing logic in version 2. Note that everything is inlined in both cases.

Case 1 inner loop, no iterators:

.L18:
    xorl    %eax, %eax
    .p2align 4,,10
    .p2align 3
.L19:
    movq    24(%rsp), %rdx
    movq    40(%rsp), %rsi
    movq    (%rdx,%rcx), %rdx
    movq    (%rsi,%rcx), %rsi
    movl    (%rdx,%rax), %edx
    imull   %edx, %edx
    movl    %edx, (%rsi,%rax)
    addq    $4, %rax
    cmpq    $20000, %rax
    jne .L19
    addq    $8, %rcx
    cmpq    $40000, %rcx
    jne .L18
    movl    $.LC2, %esi
    movl    std::cout, %edi

Case 2 inner loop, iterators:

.L34:
    movl    %eax, 56(%rsp)
    movl    %ecx, 60(%rsp)
    movl    %edi, 72(%rsp)
    movl    %edi, 76(%rsp)
    movq    72(%rsp), %rdi
    cmpq    %rdi, 56(%rsp)
    je  .L36
    movq    24(%rsp), %rdi
    movslq  %eax, %r10
    movslq  %ecx, %r9
    movslq  %edx, %r11
    addl    $1, %eax
    movq    (%rdi,%r10,8), %rdi
    movslq  %esi, %r10
    movl    (%rdi,%r9,4), %edi
    movq    40(%rsp), %r9
    imull   %edi, %edi
    movq    (%r9,%r11,8), %r9
    movl    %edi, (%r9,%r10,4)
    movl    16(%rsp), %edi
    cmpl    %edi, %eax
    je  .L37
.L20:
    addl    $1, %edx
    cmpl    32(%rsp), %edx
    jne .L34
    addl    $1, %esi
    cmpl    %esi, %edx
    cmovne  %r8d, %edx
    jmp .L34
.L37:
    addl    $1, %ecx
    cmpl    %ecx, %eax
    cmovne  %r8d, %eax
    jmp .L20
.L36:

Ultimately, I think the best advice, if you like the iterator pattern is to redefine your matrix internal array as an int*, allowing the iterator to be a simple wrapper around a pointer. This is obviously at the expense of the random-access indexing of the matrix which will need to deal with calculating the 1-d offset into the int array given x, y and row width (although hardly rocket science!).

Alex Wilson
  • 6,690
  • 27
  • 44
  • Thanks, Alex, I've added latest findings to my main post. I'm continuing investigations. – TPJ Jul 03 '12 at 07:39
  • No problem. I still recommend looking at the assembly. An SO post here on how to enable output for visual studio http://stackoverflow.com/questions/1020498/how-to-view-the-assembly-behind-the-code-msvc-if-relevent – Alex Wilson Jul 03 '12 at 08:10
  • Thanks, I am looking at the assembler in the dissassembly window. – TPJ Jul 03 '12 at 08:26
1

I think your iterator is too big. When you call operator*() worst case is that your compiler needs to fetch y_ and x_ before it can fetch the value of matrix_ at x_, y_. I would try to use raw pointers as iterators when possible. That means when matrix_ is defined as int matrix_[N][M] you can use &matrix_[0][0] as begin and &matrix_[N-1][M] as end for iteration. And of course, there's always valarray.

znkr
  • 1,746
  • 11
  • 14
  • Thanks, I think your suggestion is pointing a similar way to ecatmur's, I can explore this and report back with what I find. Much appreciated. – TPJ Jun 29 '12 at 18:37
  • As per my response to ecatmur, I changed the code to cache matrix_[y] and the performance is now at ~85% of the raw 2D array approach with the Intel compiler, i.e., a welcome improvement. However, the performance is actually worse caching this way using the MSVC compiler, at around 60%, which is strange. – TPJ Jul 03 '12 at 07:38
  • @ThomasJorda: Could be an aliasing problem. Check out `restrict`. – znkr Jul 03 '12 at 07:54
  • Edited my post to remove comments saying MSVC compiler performs worse. There was some bad science crept in. In fact the MSVC compiled version of the iterator solution performs approx. same in the iterator incrementing/dereferencing as the Intel C++ compiler, at about 85%, when I cache matrix_[y]. – TPJ Jul 03 '12 at 08:23
  • I qualified all the member variables with __restricted and enabled the use of __restricted, but it did not make any difference. – TPJ Jul 03 '12 at 14:03
0

1. Memory localization. Guaranteed contiguous. I noticed you clarified that variables mat1, and mat2, are int**. But how is matrix_ handled in memory. Interators just point anywhere conceivably. Is your memory for matrix_ localized? Heap based multidimensional arrays may not be contiguous. But Vector<> is.

This line of code isn't using the actual interators, but using their variables to index an array that is localized.

mat2[mat2Begin.y_][mat2Begin.x_] = f(mat1[mat1Begin.y_][mat1Begin.x_]);

2. You're forgetting optimizations. In the second usage of using the incrementing operator, you're doing that step before calling the functor.

This may mean that calling the functor passing it an object that is dereferenced through the operator is interfering with the optimizers ability to prioritize order.

Try storing off the dereferenced object before calling op() on it, and see if that eliminates the cost.

for (; first1 != last1; ++first1, ++result)
{
    InputIterator::value_type val = *first1;
    *result = op(val);
}

I've seen some funky things with using operators on argument assignment. To the point where it's deferred resolving until after the call even (sending in some other interpretation of the expression, and resolving the expression after the call), and argument resolution order is not guaranteed. It's best to send the actual intended object through the argument if you have efficiency problems.

Lee Louviere
  • 5,162
  • 30
  • 54
  • Well your suggestion should in that case be: InputIterator::value_type val = *first1 but this doesn't help (I've just tried it). Thanks for suggestion anyway! – TPJ Jun 29 '12 at 17:00
0

You're hoisting the double indirection matrix_[y_][x_] from a function call into the loop. Possibly the compiler is managing to cache the pointer matrix_[y_] in one case but not the other; could you try caching matrix_[y_] in the iterator?

ecatmur
  • 152,476
  • 27
  • 293
  • 366
  • @ThomasJordan This is the only other likely scenario. – Lee Louviere Jun 29 '12 at 17:07
  • I changed the code to cache matrix_[y] and the performance is now at ~85% of the raw 2D array approach with both Intel and MSVC compiler, i.e., a welcome improvement. – TPJ Jul 03 '12 at 08:24