C++11: Efficiently iterating over matrix with single range-loop?

Question

(For a concrete compiler/platform context take GCC 4.7 and Ubuntu 12.04 on x86_64)

Given some function f:

void f(int x, int y);

int nx = ...;
int ny = ...;

One way to iterate over every value (x,y) from (0,0) to (nx,ny) is:

for (int x = 0; x < nx; x++)
    for (int y = 0; y < ny; y++)
        f(x,y);

Let this compile to some generated code Q1.

We will write a function g such that:

for (auto it : g(Z))
    f(it.x, it.y);

compiles to code Q2.

Is it possible to write g such that Q2 is as efficient as Q1? If yes, how? If not, what is the closest we can get?

You may change auto to auto& or auto&& if it helps.

You may also change it.x to it.x(), and it.y to it.y(), if it helps.

(Recall that the expansion of range-based for is just an iterator-like type of your choosing: C++11: The range-based for statement: "range-init" lifetime?)

Removed `[c]` tag, there is no range-for in C, same goes for `auto` as a deducing declaration type. Also, this isn't really specific to Linux, so I removed that tag too. — Xeo, May 31 '12 at 02:42
While you might be able to get to something *close* in performance to the original loop, chances are that it is going to be at least marginally slower (more complex code implies harder work for the compiler to optimize, and there are going to be extra objects and such. What is the problem with the simple double loop? (Also note the simplest trivial approach: `for (int i = 0; i < nx*ny; ++i) f( i%nx, i/nx );` -- again marginally slower than the original — David Rodríguez - dribeas, May 31 '12 at 03:02
@DavidRodríguez-dribeas: Actually mod and div are typically implemented with an IDIV instruction, which is about 100 times slower than an increment or a compare. — Andrew Tomazos, May 31 '12 at 10:27
@DavidRodríguez-dribeas: The correct answer is independent of the cost of f. — Andrew Tomazos, May 31 '12 at 16:27
The question is rather absurd (changing a common simple loop into something uncommon) and you are asking for 'similar performance'. The relative cost of the solution with respect to the overall operation will make the performance similar or unsimilar. If `f` is a couple of cycles, no solution will be *similar* in performance (a single extra cycle might have a huge impact on the overall loop), but if `f` takes seconds, all solutions for the loop will have *similar* performance (1 or 100 cycles will have no impact on the cost of the whole loop) — David Rodríguez - dribeas, May 31 '12 at 16:51
@DavidRodríguez-dribeas: (a) I didn't ask for "similar" performance, reread question (equal or closest); (b) Providing objects with an iterator interface is neither absurd or uncommon. It enables compatibility with all the generic language features and library functions (beyond range-based for) that use them; and (c) It is obvious that a compound entity with a component much larger than the rest is dominated by that component. — Andrew Tomazos, May 31 '12 at 17:35

score 3 · Accepted Answer · edited Jun 02 '12 at 01:26

3

Is it possible to write g such that Q2 is as efficient as Q1? If yes, how? If not, what is the closest we can get?

Sure its possible, you just need to define iterators that increment in the same way as your for loop. From the top of my head:

class matrix_iterator
{
public:
    ...

    matrix_iterator& operator++()
    {
        if( ++y >= ny )
        {
            ++x;
            y = 0;
        }

        return *this;
    }

private:
    int nx, ny;
    int x, y;
};

edited Jun 02 '12 at 01:26

Jonathan Wakely

166,810
27
341
521

answered May 31 '12 at 02:38

K-ballo

80,396
20
159
169

1

Depending on the cost of `f` the extra cost of the iterator and the copy to the variable in the range-based for loop might be noticeable --Not complaining about the solution, but I just don't think that the question makes sense, and both answers provide a complex solution to a simple problem: just write the two ##### loops! :) – David Rodríguez - dribeas May 31 '12 at 03:06
1

@David Rodríguez - dribeas: Really? Have you seen today's compilers? They can and do optimize out this kind of thing so that the generated code is equivalent. – K-ballo May 31 '12 at 03:07
1

I have seen todays compilers... and the effect will not be *large*, but if `f` is a light weight function it can have a (small) impact. Just as a generic question... `std::bind( &f, x )` in different implementations: boost::bind compiled with g++ makes 4 copies of `x`, g++4.7 `std::bind` makes a single copy (what I would expect), VS2010 makes 7, yes 7!!! copies of `x`. I have seen what compilers do, and the fact is that for something as simple as this, it makes no sense to go overboard with a more complex, harder to maintain and potentially more expensive solution... just saying :) – David Rodríguez - dribeas May 31 '12 at 03:14

SoapBox · Answer 2 · 2012-05-31T19:31:41.857

2

This code has the functionality that you want. I haven't verified it, but I suspect it will produce almost identical machine code (in an optimized compile) as the original loops.

struct iter {
    int x, y, ny;
    iter(int x, int y, int ny) : x(x), y(y), ny(ny) {}
    iter &operator++ () {
        if (++y >= ny)
        {
            y = 0;
            ++x;
        }
        return *this;
    }
    bool operator != (iter const &rhs) const {
        return y != rhs.y || x != rhs.x;
    }
};

struct container {
    iter endit;
    container(int nx, int ny) : endit(nx, ny, ny) {}
    iter begin() const { return iter(0,0,endit.ny); }
    iter const &end() const { return endit; }
};

container g(Z const &z) { return container(z.nx, z.ny); }

for ( auto it : g(Z) )
    f(it.x, it.y);

edited May 31 '12 at 19:31

answered May 31 '12 at 02:48

SoapBox

20,457
3
51
87

Where does the `ny` in `container::begin` come from? – Xeo May 31 '12 at 02:52
Making `endit` const makes `container` pointlessly noncopyable. Make it private and non-const and `container` will have the exact same semantics + copyability. – ildjarn May 31 '12 at 18:56
I don't that copyability is important for this particular example, but I changed it to non-const anyways. – SoapBox May 31 '12 at 19:32
I don't see how this could be more efficient. Totally, you're doing x * y * 3 comparisons : 2 in operator!= and 1 in operator++ for each (x, y). On the other hand, the original 2 imbricated loops do only, x * y comparisons. – Benoît Jun 01 '12 at 06:19
1

The const member made it _non-assignable_ not non-copyable. – Jonathan Wakely Jun 02 '12 at 01:33
@Benoit Short circuiting means 1 of the operations in the operator != will happen less often, so it is closer to x * y * 2 operations, and I believe the an optimizing compiler might be able to get rid of some more of them, making it close to the same number of comparisons. But feel free to compare the assembly and/or performance and let us know the difference. – SoapBox Jun 02 '12 at 04:07
You're right. But ack that our "loop" seems twice more expensive. Of course, assembly needs to be compared and see what the compiler does. But i feel like we're getting quite far for template loops unrolling for the sake beauty. – Benoît Jun 02 '12 at 08:46

C++11: Efficiently iterating over matrix with single range-loop?

2 Answers2