Performance difference with custom iterator

Question

I was trying to implement an iterator over the pixels of a 3D image. The image is stored in a single vector, and I do index mapping to go from (x, y, z) to the vector index. The idea is to be able to do a range-based for on the image.

Of course, iterating over the pixels by using the vector iterators is really fast (0.6s in my case), but I don't have access to the indexes.

A triple for loop (z, y, x to be in the memory order) is as fast on VS2015, but way slower (2.4s) with gcc or clang (gcc 5.2, clang 3.9 as well as older versions).

I wrote an iterator that returns a tuple (x, y, z, value), and it's also around 2.4s, even with VS2015. The iterator contains a reference to the image, and the three field x, y, and z, and the value of the pixel is obtained by using the same pixel_at(x, y, z) as the triple loop.

So, I have 2 questions:

Why is VS2015 so much faster than GCC and Clang? (answered in edit)
How can I achieve the faster time while still having access to x, y, z?

Here's the increment of the x/y/z iterator:

_x++;
if (_x == _im.width())
{
    _x = 0;
    _y++;
    if (_y == _im.height())
    {
        _y = 0;
        _z++;
    }
}

The loops that are measured are the following:

Triple for-loop:

for (int k = 0; k < im.depth(); ++k)
    for (int j = 0; j < im.height(); ++j)
        for (int i = 0; i < im.width(); ++i)
            sum += im.pixel_at(i, j, k);

Iterate over the vector:

for (unsigned char c : im.pixels())
    sum += c;

Iterate with the custom iterator:

for (const auto& c : im.indexes())
    sum += std::get<3>(c);

The iterator implemented with a single index gives me the good performance (0.6s), but then I don't have access to x, y, z, and computing them on the fly is too slow (6s). Having x, y, z and the index in the iterator doesn't help either (2.4s).

The benchmark is 100 passes on the image, while summing the values of the pixels, randomly initialized. For the cases where I have the indexes too, I made sure that the value of the indexes are read (for instance, the sum of the z coordinates) so that the compiler doesn't optimize them. It doesn't change anything, except for the case where I compute the indexes on the fly, where the computation was optimized away.

The whole code can be found here: https://github.com/nitnelave/3DIterator

Thanks!

EDIT: After enabling link-time optimization (-flto), the triple for loop is as fast (0.19s) on g++ as the vector iterators (0.18s), but the custom iterator is still 6 times slower (1.2s). That's already much better than the 3.2s I was getting for both the for loop and the iterator, but we're not quite there yet.

EDIT: I isolated the loops inside functions which I forbade gcc to inline, to isolate the assembly code, and here it is:

Triple for loop:

push   %r15
push   %r14
push   %r13
push   %r12
push   %rbp
push   %rdi
push   %rsi
push   %rbx
sub    $0x78,%rsp
mov    0x8(%rcx),%eax
pxor   %xmm4,%xmm4
mov    %rcx,%r12
movl   $0x64,0x5c(%rsp)

xor    %r13d,%r13d
xor    %r14d,%r14d
mov    %eax,0x58(%rsp)
mov    0x58(%rsp),%edx
test   %edx,%edx
jle    10040163a <_Z17bench_triple_loopRK7Image3D.constprop.5+0x2fa>
mov    0x4(%r12),%eax
movl   $0x0,0x54(%rsp)

movl   $0x0,0x2c(%rsp)

mov    %eax,0x48(%rsp)
nopl   0x0(%rax,%rax,1)
nopw   %cs:0x0(%rax,%rax,1)

mov    0x48(%rsp),%eax
test   %eax,%eax
jle    10040161f <_Z17bench_triple_loopRK7Image3D.constprop.5+0x2df>
movslq (%r12),%rax
mov    0x54(%rsp),%r10d
mov    0x2c(%rsp),%ebx
mov    %rax,%rcx
mov    %rax,0x40(%rsp)
add    %rax,%rax
mov    %rax,0x38(%rsp)
lea    -0x1(%rcx),%eax
imul   %ecx,%r10d
imul   %eax,%ebx
mov    %eax,0x50(%rsp)
movslq %r10d,%rsi
lea    (%rsi,%rsi,1),%r9
mov    %ebx,0x4c(%rsp)
xor    %ebx,%ebx
xchg   %ax,%ax
nopw   %cs:0x0(%rax,%rax,1)

test   %ecx,%ecx
jle    100401605 <_Z17bench_triple_loopRK7Image3D.constprop.5+0x2c5>
mov    0x10(%r12),%rdx
lea    (%rdx,%r9,1),%r8
mov    %r8,%rax
and    $0xf,%eax
shr    %rax
neg    %rax
and    $0x7,%eax
cmp    %ecx,%eax
cmova  %ecx,%eax
cmp    $0xa,%ecx
cmovbe %ecx,%eax
test   %eax,%eax
je     100401690 <_Z17bench_triple_loopRK7Image3D.constprop.5+0x350>
movswl (%r8),%r8d
add    %r8d,%r14d
cmp    $0x1,%eax
je     100401720 <_Z17bench_triple_loopRK7Image3D.constprop.5+0x3e0>
movswl 0x2(%rdx,%r9,1),%r8d
add    %r8d,%r14d
cmp    $0x2,%eax
je     100401710 <_Z17bench_triple_loopRK7Image3D.constprop.5+0x3d0>
movswl 0x4(%rdx,%r9,1),%r8d
add    %r8d,%r14d
cmp    $0x3,%eax
je     100401700 <_Z17bench_triple_loopRK7Image3D.constprop.5+0x3c0>
movswl 0x6(%rdx,%r9,1),%r8d
add    %r8d,%r14d
cmp    $0x4,%eax
je     1004016f0 <_Z17bench_triple_loopRK7Image3D.constprop.5+0x3b0>
movswl 0x8(%rdx,%r9,1),%r8d
add    %r8d,%r14d
cmp    $0x5,%eax
je     1004016e0 <_Z17bench_triple_loopRK7Image3D.constprop.5+0x3a0>
movswl 0xa(%rdx,%r9,1),%r8d
add    %r8d,%r14d
cmp    $0x6,%eax
je     1004016d0 <_Z17bench_triple_loopRK7Image3D.constprop.5+0x390>
movswl 0xc(%rdx,%r9,1),%r8d
add    %r8d,%r14d
cmp    $0x7,%eax
je     1004016c0 <_Z17bench_triple_loopRK7Image3D.constprop.5+0x380>
movswl 0xe(%rdx,%r9,1),%r8d
add    %r8d,%r14d
cmp    $0x8,%eax
je     1004016b0 <_Z17bench_triple_loopRK7Image3D.constprop.5+0x370>
movswl 0x10(%rdx,%r9,1),%r8d
add    %r8d,%r14d
cmp    $0xa,%eax
jne    1004016a0 <_Z17bench_triple_loopRK7Image3D.constprop.5+0x360>
movswl 0x12(%rdx,%r9,1),%r8d
add    %r8d,%r14d
mov    $0xa,%r8d
cmp    %ecx,%eax
je     1004015fb <_Z17bench_triple_loopRK7Image3D.constprop.5+0x2bb>
mov    %eax,%r15d
mov    %ecx,%edi
sub    %eax,%edi
mov    %r15,0x30(%rsp)
mov    0x50(%rsp),%r15d
lea    -0x8(%rdi),%r11d
sub    %eax,%r15d
shr    $0x3,%r11d
add    $0x1,%r11d
cmp    $0x6,%r15d
lea    0x0(,%r11,8),%ebp

jbe    100401573 <_Z17bench_triple_loopRK7Image3D.constprop.5+0x233>
mov    0x30(%rsp),%r15
pxor   %xmm0,%xmm0
xor    %eax,%eax
add    %rsi,%r15
lea    (%rdx,%r15,2),%r15
movdqa (%r15),%xmm1
movdqa %xmm4,%xmm3
add    $0x1,%eax
add    $0x10,%r15
pcmpgtw %xmm1,%xmm3
movdqa %xmm1,%xmm2
cmp    %eax,%r11d
punpcklwd %xmm3,%xmm2
punpckhwd %xmm3,%xmm1
paddd  %xmm2,%xmm0
paddd  %xmm1,%xmm0
ja     10040151a <_Z17bench_triple_loopRK7Image3D.constprop.5+0x1da>
movdqa %xmm0,%xmm1
add    %ebp,%r8d
psrldq $0x8,%xmm1
 paddd  %xmm1,%xmm0
 movdqa %xmm0,%xmm1
 psrldq $0x4,%xmm1
 paddd  %xmm1,%xmm0
 movd   %xmm0,%eax
 add    %eax,%r14d
 cmp    %ebp,%edi
 je     1004015fb <_Z17bench_triple_loopRK7Image3D.constprop.5+0x2bb
 lea    (%r10,%r8,1),%eax
 cltq
 movswl (%rdx,%rax,2),%eax
 add    %eax,%r14d
 lea    0x1(%r8),%eax
 cmp    %eax,%ecx
 jle    1004015fb <_Z17bench_triple_loopRK7Image3D.constprop.5+0x2bb
 add    %r10d,%eax
 cltq
 movswl (%rdx,%rax,2),%eax
 add    %eax,%r14d
 lea    0x2(%r8),%eax
 cmp    %eax,%ecx
 jle    1004015fb <_Z17bench_triple_loopRK7Image3D.constprop.5+0x2bb
 add    %r10d,%eax
 cltq
 movswl (%rdx,%rax,2),%eax
 add    %eax,%r14d
 lea    0x3(%r8),%eax
 cmp    %eax,%ecx
 jle    1004015fb <_Z17bench_triple_loopRK7Image3D.constprop.5+0x2bb
 add    %r10d,%eax
 cltq
 movswl (%rdx,%rax,2),%eax
 add    %eax,%r14d
 lea    0x4(%r8),%eax
 cmp    %eax,%ecx
 jle    1004015fb <_Z17bench_triple_loopRK7Image3D.constprop.5+0x2bb
 add    %r10d,%eax
 cltq
 movswl (%rdx,%rax,2),%eax
 add    %eax,%r14d
 lea    0x5(%r8),%eax
 cmp    %eax,%ecx
 jle    1004015fb <_Z17bench_triple_loopRK7Image3D.constprop.5+0x2bb
 add    %r10d,%eax
 add    $0x6,%r8d
 cltq
 movswl (%rdx,%rax,2),%eax
 add    %eax,%r14d
 cmp    %r8d,%ecx
 jle    1004015fb <_Z17bench_triple_loopRK7Image3D.constprop.5+0x2bb
 add    %r10d,%r8d
 movslq %r8d,%r8
 movswl (%rdx,%r8,2),%eax
 add    %eax,%r14d
 add    0x2c(%rsp),%r13d
 add    0x4c(%rsp),%r13d
 add    $0x1,%ebx
 add    0x38(%rsp),%r9
 add    0x40(%rsp),%rsi
 add    %ecx,%r10d
 cmp    %ebx,0x48(%rsp)
 jne    1004013f0 <_Z17bench_triple_loopRK7Image3D.constprop.5+0xb0>
 addl   $0x1,0x2c(%rsp)
 mov    0x48(%rsp),%ebx
 mov    0x2c(%rsp),%eax
 add    %ebx,0x54(%rsp)
 cmp    0x58(%rsp),%eax
 jne    1004013a0 <_Z17bench_triple_loopRK7Image3D.constprop.5+0x60>
 subl   $0x1,0x5c(%rsp)
 jne    10040136c <_Z17bench_triple_loopRK7Image3D.constprop.5+0x2c>
mov    0x2a64(%rip),%rcx        # 1004040b0 <__fu0__ZSt4cout>
mov    %r13d,%eax
mov    %r14d,%r13d
mov    %eax,%edx
callq  100401828 <_ZNSolsEi>
lea    0x6f(%rsp),%rdx
mov    $0x1,%r8d
mov    %rax,%rcx
movb   $0xa,0x6f(%rsp)
callq  100401800 <_ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_l>
mov    %r13d,%eax
add    $0x78,%rsp
pop    %rbx
pop    %rsi
pop    %rdi
pop    %rbp
pop    %r12
pop    %r13
pop    %r14
pop    %r15
retq
nop
nopw   %cs:0x0(%rax,%rax,1)

xor    %r8d,%r8d
jmpq   1004014da <_Z17bench_triple_loopRK7Image3D.constprop.5+0x19a>
nopl   0x0(%rax,%rax,1)

mov    $0x9,%r8d
jmpq   1004014d2 <_Z17bench_triple_loopRK7Image3D.constprop.5+0x192>
nopl   0x0(%rax,%rax,1)
mov    $0x8,%r8d
jmpq   1004014d2 <_Z17bench_triple_loopRK7Image3D.constprop.5+0x192>
nopl   0x0(%rax,%rax,1)
mov    $0x7,%r8d
jmpq   1004014d2 <_Z17bench_triple_loopRK7Image3D.constprop.5+0x192>
nopl   0x0(%rax,%rax,1)
mov    $0x6,%r8d
jmpq   1004014d2 <_Z17bench_triple_loopRK7Image3D.constprop.5+0x192>
nopl   0x0(%rax,%rax,1)
mov    $0x5,%r8d
jmpq   1004014d2 <_Z17bench_triple_loopRK7Image3D.constprop.5+0x192>
nopl   0x0(%rax,%rax,1)
mov    $0x4,%r8d
jmpq   1004014d2 <_Z17bench_triple_loopRK7Image3D.constprop.5+0x192>
nopl   0x0(%rax,%rax,1)
mov    $0x3,%r8d
jmpq   1004014d2 <_Z17bench_triple_loopRK7Image3D.constprop.5+0x192>
nopl   0x0(%rax,%rax,1)
mov    $0x2,%r8d
jmpq   1004014d2 <_Z17bench_triple_loopRK7Image3D.constprop.5+0x192>
nopl   0x0(%rax,%rax,1)
mov    $0x1,%r8d
jmpq   1004014d2 <_Z17bench_triple_loopRK7Image3D.constprop.5+0x192>
nopl   0x0(%rax,%rax,1)

Iteration over the vector:

push   %r14
push   %r13
push   %r12
push   %rbp
push   %rdi
push   %rsi
push   %rbx
mov    0x10(%rcx),%r8
mov    0x18(%rcx),%r10
mov    $0x64,%ebx
pxor   %xmm4,%xmm4
lea    0x2(%r8),%r12
mov    %r10,%rdi
mov    %r8,%rbp
and    $0xf,%ebp
sub    %r12,%rdi
shr    %rbp
shr    %rdi
neg    %rbp
lea    0x1(%rdi),%r11
and    $0x7,%ebp
cmp    %r11,%rbp
cmova  %r11,%rbp
xor    %eax,%eax
xchg   %ax,%ax
nopw   %cs:0x0(%rax,%rax,1)

cmp    %r8,%r10
je     1004012dc <_Z10bench_iterRK7Image3D.constprop.4+0x1fc>
cmp    $0xa,%r11
mov    %r11,%rcx
ja     1004012f0 <_Z10bench_iterRK7Image3D.constprop.4+0x210>
movswl (%r8),%edx
add    %edx,%eax
cmp    $0x1,%rcx
je     100401300 <_Z10bench_iterRK7Image3D.constprop.4+0x220>
movswl 0x2(%r8),%edx
add    %edx,%eax
cmp    $0x2,%rcx
lea    0x4(%r8),%rdx
je     1004011ed <_Z10bench_iterRK7Image3D.constprop.4+0x10d>
movswl 0x4(%r8),%edx
add    %edx,%eax
cmp    $0x3,%rcx
lea    0x6(%r8),%rdx
je     1004011ed <_Z10bench_iterRK7Image3D.constprop.4+0x10d>
movswl 0x6(%r8),%edx
add    %edx,%eax
cmp    $0x4,%rcx
lea    0x8(%r8),%rdx
je     1004011ed <_Z10bench_iterRK7Image3D.constprop.4+0x10d>
movswl 0x8(%r8),%edx
add    %edx,%eax
cmp    $0x5,%rcx
lea    0xa(%r8),%rdx
je     1004011ed <_Z10bench_iterRK7Image3D.constprop.4+0x10d>
movswl 0xa(%r8),%edx
add    %edx,%eax
cmp    $0x6,%rcx
lea    0xc(%r8),%rdx
je     1004011ed <_Z10bench_iterRK7Image3D.constprop.4+0x10d>
movswl 0xc(%r8),%edx
add    %edx,%eax
cmp    $0x7,%rcx
lea    0xe(%r8),%rdx
je     1004011ed <_Z10bench_iterRK7Image3D.constprop.4+0x10d>
movswl 0xe(%r8),%edx
add    %edx,%eax
cmp    $0x8,%rcx
lea    0x10(%r8),%rdx
je     1004011ed <_Z10bench_iterRK7Image3D.constprop.4+0x10d>
movswl 0x10(%r8),%edx
add    %edx,%eax
cmp    $0xa,%rcx
lea    0x12(%r8),%rdx
jne    1004011ed <_Z10bench_iterRK7Image3D.constprop.4+0x10d>
movswl 0x12(%r8),%edx
add    %edx,%eax
lea    0x14(%r8),%rdx
cmp    %rcx,%r11
je     1004012dc <_Z10bench_iterRK7Image3D.constprop.4+0x1fc>
mov    %r11,%rsi
mov    %rdi,%r14
sub    %rcx,%rsi
sub    %rcx,%r14
lea    -0x8(%rsi),%r9
shr    $0x3,%r9
add    $0x1,%r9
cmp    $0x6,%r14
lea    0x0(,%r9,8),%r13

jbe    10040127d <_Z10bench_iterRK7Image3D.constprop.4+0x19d>
pxor   %xmm0,%xmm0
lea    (%r8,%rcx,2),%r14
xor    %ecx,%ecx
movdqa (%r14),%xmm1
add    $0x1,%rcx
movdqa %xmm4,%xmm2
add    $0x10,%r14
movdqa %xmm1,%xmm3
cmp    %rcx,%r9
pcmpgtw %xmm1,%xmm2
punpcklwd %xmm2,%xmm3
punpckhwd %xmm2,%xmm1
paddd  %xmm3,%xmm0
paddd  %xmm1,%xmm0
ja     100401226 <_Z10bench_iterRK7Image3D.constprop.4+0x146>
movdqa %xmm0,%xmm1
lea    (%rdx,%r13,2),%rdx
psrldq $0x8,%xmm1
paddd  %xmm1,%xmm0
movdqa %xmm0,%xmm1
psrldq $0x4,%xmm1
paddd  %xmm1,%xmm0
movd   %xmm0,%ecx
add    %ecx,%eax
cmp    %r13,%rsi
je     1004012dc <_Z10bench_iterRK7Image3D.constprop.4+0x1fc>
movswl (%rdx),%ecx
add    %ecx,%eax
lea    0x2(%rdx),%rcx
cmp    %rcx,%r10
je     1004012dc <_Z10bench_iterRK7Image3D.constprop.4+0x1fc>
movswl 0x2(%rdx),%ecx
add    %ecx,%eax
lea    0x4(%rdx),%rcx
cmp    %rcx,%r10
je     1004012dc <_Z10bench_iterRK7Image3D.constprop.4+0x1fc>
movswl 0x4(%rdx),%ecx
add    %ecx,%eax
lea    0x6(%rdx),%rcx
cmp    %rcx,%r10
je     1004012dc <_Z10bench_iterRK7Image3D.constprop.4+0x1fc>
movswl 0x6(%rdx),%ecx
add    %ecx,%eax
lea    0x8(%rdx),%rcx
cmp    %rcx,%r10
je     1004012dc <_Z10bench_iterRK7Image3D.constprop.4+0x1fc>
movswl 0x8(%rdx),%ecx
add    %ecx,%eax
lea    0xa(%rdx),%rcx
cmp    %rcx,%r10
je     1004012dc <_Z10bench_iterRK7Image3D.constprop.4+0x1fc>
movswl 0xa(%rdx),%ecx
add    %ecx,%eax
lea    0xc(%rdx),%rcx
cmp    %rcx,%r10
je     1004012dc <_Z10bench_iterRK7Image3D.constprop.4+0x1fc>
movswl 0xc(%rdx),%edx
add    %edx,%eax
sub    $0x1,%ebx
jne    100401130 <_Z10bench_iterRK7Image3D.constprop.4+0x50>
pop    %rbx
pop    %rsi
pop    %rdi
pop    %rbp
pop    %r12
pop    %r13
pop    %r14
retq
test   %rbp,%rbp
jne    100401308 <_Z10bench_iterRK7Image3D.constprop.4+0x228>
xor    %ecx,%ecx
mov    %r8,%rdx
jmpq   1004011f6 <_Z10bench_iterRK7Image3D.constprop.4+0x116>
nop
mov    %r12,%rdx
jmpq   1004011ed <_Z10bench_iterRK7Image3D.constprop.4+0x10d>
mov    %rbp,%rcx
jmpq   100401146 <_Z10bench_iterRK7Image3D.constprop.4+0x66>

Custom iterator:

push   %r14
push   %rbp
push   %rdi
push   %rsi
push   %rbx
sub    $0x30,%rsp
mov    (%rcx),%r10d
mov    $0x64,%ebp
xor    %ebx,%ebx
mov    %rcx,%rdi
mov    0x4(%rcx),%ecx
xor    %edx,%edx
mov    0x8(%rdi),%esi
nop
xor    %r9d,%r9d
xor    %r11d,%r11d
xor    %r8d,%r8d
nopl   0x0(%rax)
cmp    %r9d,%esi
je     1004017b0 <_Z16bench_index_iterRK7Image3D.constprop.0+0x80>
mov    %ecx,%eax
mov    0x10(%rdi),%r14
add    %r9d,%edx
imul   %r9d,%eax
add    %r11d,%eax
imul   %r10d,%eax
add    %r8d,%eax
add    $0x1,%r8d
cltq
movswl (%r14,%rax,2),%eax
add    %eax,%ebx
cmp    %r10d,%r8d
jne    100401760 <_Z16bench_index_iterRK7Image3D.constprop.0+0x30>
add    $0x1,%r11d
xor    %r8d,%r8d
cmp    %r11d,%ecx
jne    100401760 <_Z16bench_index_iterRK7Image3D.constprop.0+0x30>
add    $0x1,%r9d
xor    %r11d,%r11d
cmp    %r9d,%esi
jne    100401765 <_Z16bench_index_iterRK7Image3D.constprop.0+0x35>
nopw   %cs:0x0(%rax,%rax,1)

test   %r11d,%r11d
jne    100401765 <_Z16bench_index_iterRK7Image3D.constprop.0+0x35>
test   %r8d,%r8d
jne    100401765 <_Z16bench_index_iterRK7Image3D.constprop.0+0x35>
sub    $0x1,%ebp
jne    100401750 <_Z16bench_index_iterRK7Image3D.constprop.0+0x20>
mov    0x28ea(%rip),%rcx        # 1004040b0 <__fu0__ZSt4cout>
callq  100401828 <_ZNSolsEi>
lea    0x2f(%rsp),%rdx
mov    $0x1,%r8d
mov    %rax,%rcx
movb   $0xa,0x2f(%rsp)
callq  100401800 <_ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_l>
mov    %ebx,%eax
add    $0x30,%rsp
pop    %rbx
pop    %rsi
pop    %rdi
pop    %rbp
pop    %r14
retq

So it seems that there is loop unrolling for the first 2 benches, but not for the custom iterator.

EDIT: I also tried using the boost::asio::coroutines, but it's slower than my solution.

You're asking us to explain the time difference between different versions of code that you didn't show us... — Barry, May 17 '16 at 17:18
I didn't want to show too much code, as the shortest I have is spread over 7 files :/ But if you think that's necessary, I can upload it somewhere. EDIT: I added a link to the code. — V Tolmer, May 17 '16 at 17:23
Your entire question concerns different pairs of loops and iterator increments. Shows us those loop constructs (forget the body) and their corresponding iterator increments. It's a few short snippets, no need for entire files. — Cameron, May 17 '16 at 18:00
Why do you think computing `x/y/z` on the fly is too slow? In your example code you do not use `x/y/z` (except for `operator!=` which can easily use the index instead. — Zulan, May 17 '16 at 21:04
Interestingly, `clang` handles the `x/y/z` iterator much better than `gcc` on my system, but is overall slightly slower on the optimized loops. `clang: 0.43, 0.15, 0.15`, `gcc: 0.12, 0.11, 0.47`. — Zulan, May 17 '16 at 21:07
@Zulan The idea is to have x, y, z to run some algorithm that uses the index. This is a generic method, so just the index is not enough. — V Tolmer, May 18 '16 at 00:32

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

Why is it slower

As you show nicely in the disassembly, the compiler generates vastly different code for the different loops. The main disadvantage of the generated machine code for the custom iterator is, that it does not utilize SIMD instructions.

Compiler optimizations are incredibly complex. You could spend days just looking at the intermediate optimization steps. Also the performance varies between different compilers (and compiler versions). In my tests clang was the only compiler that generated fast SIMD code for your iterator.

You already have one thing perfectly right: Measure. If performance of that part matters for you, you have to measure it.

How to speed it up

The basic idea is to help the compiler with the loop. For instance the end condition. The compiler needs to be able to prove that it doesn't change. You can try to locally keep a const copy the width/height of the image. Unfortunately I haven't been able to achieve significant speedup on your example code.

Another issue is the end iterator. You need to check three conditions in each iteration, even if only one (z) matters. That bring us to alternative loop end conditions such as sentinels.

range-v3

With range-v3 one could potentially implement more efficient and elegant loops. I made a quick attempt at your code:

#pragma once

#include "Image3D.h"

#include <range/v3/all.hpp>

template <typename Image>
class range_3d : public ranges::view_facade<range_3d<Image>>
{
    using data_t = typename Image::data_t;
    friend ranges::range_access;

    int x_ = 0;
    int y_ = 0;
    int z_ = 0;
    // Note: This cannot be a reference, because ranges needs a default constructor
    // If it doesn't get one, the compiler errors will haunt you in your dreams.
    Image* im_;

    data_t const& get() const
    {
        return im_->pixel_at(x_, y_, z_);
    }

    bool done() const
    {
        return z_ == im_->depth();
    }

    void next()
    {
        x_++;
        if (x_ == im_->width())
        {
            x_ = 0;
            y_++;
            if (y_ == im_->height())
            {
                y_ = 0;
                z_++;
            }
        }
    }

public:
    range_3d() = default;
    explicit range_3d(Image& im)
    : im_(&im)
    {
    }
};

// use like:
range_3d<Image3D> r(im);
ranges::for_each(r, [&sum](unsigned char c) {
    sum += c;
});

Unfortunately it still doesn't provide perfect performance, but at least the code is much nicer (as long as you don't get a compiler error).

Context matters

You have to keep in mind, that the performance of the iterators may be much different when you actually use of x/y/z. If your compiler cannot do a sum up the vector optimization, the iterator will probably show more similar performance.

On-demand x/y/z computation

I still think you might want to consider not keeping the separate loop variables. Maybe you could even optimize for images with power-of-two sizes, or block the loop somehow, so that you can compute x/y/z very efficiently from the index using bit operations.

Zero-overhead abstractions

One of the properties i like most about C++ is that it allows you to create nice abstractions with absolutely no runtime overhead. It pains me that I / the compilers fail in this case. I feel this answer lacks a happy-end, so I'd encourage everyone to attempt this as well.

Thanks for that complete answer! I would have hoped for a fast implementation, though :) On your Context matters part, I tried a run where I summed the 6 neighbours (up, down, left, right, front, back), and the performance betweet the 3 `for` and the iterator was 7s vs 10s. So, it's better, but we're not there yet. I'll end up going the for_each way, eventually, but I wanted to give my users the possibility. — V Tolmer, May 20 '16 at 23:12