efficient loop collapse

Question

in certain applications, I have need to collapse nested loops into one while retaining individual index information.

for j in N:
  for i in M:
    ... A(i,j) ...

// Collapse the loops
for ij in MN:
  ... A(i,j) ...

so have looked at the obvious ways to recover i,j from ij using division/modulo (expensive operation) and using if statements (breaks vectorization, branch prediction problems).in the end i came up with the following (using C-style comparisons):

j += (i == m)
i *= (i != m)
++i, ++ij

is there perhaps a even better way to do that? thanks

This is not a language agnostic question. I can think of ways to do this in Haskell or Python that don't work in other languages. Could you explain why you need to collapse nested loops? — Dietrich Epp, Jan 22 '10 at 06:14
well, in principle any turing language can accomplish python product/range combination, however it may not be efficient. The main reasons to collapse loops: matrix iterators, low level vectorization (think SSE/cuda) — Anycorn, Jan 25 '10 at 01:17

Will · Answer 1 · 2010-01-22T06:58:29.647

8

In general, it offers no performance advantage to collapse the loop as described.

Compilers do sometimes collapse such loops, but typically in unexpected ways.

In particular languages, or on particular platforms, you can speed up loops in general by:

counting downwards
making the function called in the body 'inline', or having the code in the loop body rather than a separate function
configuring the compiler - typically via command-line options - to 'unroll' loops and to remove frame pointers and such

But in all cases you have to have profiled your code to see that such efforts are warranted.

Generally, in my experience, nested loops like this are dominated by:

containers; avoid boxing and bounds checking if possible and you know you're safe
the cost of invoking other methods in them; use 'inline' if thats available
pipeline stalls by bad locality of reference; rearrange your memory if possible
pipeline stalls by second conditions; fewer ifs and indirect references is better

But that might not be applicable advice on your problem domain and platform. Profile!

edited Jan 22 '10 at 06:58

answered Jan 22 '10 at 06:51

Will

73,905
40
169
246

++ Right, right, right. I would only say in place of Profile, use stackshots (http://stackoverflow.com/questions/406760/whats-your-most-controversial-programming-opinion/1562802#1562802) because almost always this kind of optimizing is "barking up the wrong tree". – Mike Dunlavey Jan 22 '10 at 17:11
Agree, programing is people, compiling is for computers – Luka Rahne Jan 23 '10 at 02:39
1

there is several reasons to do that: matrix/array iterators, optimization (only inner loops vectorized), cuda programming, OpenMP programming (3.0 has that). Compilers are very good but sometimes they need help from a human. – Anycorn Jan 25 '10 at 01:13
I think the OP should illustrate these cases then. Because I've never seen my cuda go short on registers for separate iterators for each dimension. – Will Jan 25 '10 at 07:13

score 0 · Answer 2 · answered Jan 22 '10 at 06:45

0

Going other way might be cheaper.

for j in N:
  for i in M:
    ij=j*i+j

answered Jan 22 '10 at 06:45

Luka Rahne

10,336
3
34
56

1

or, `ij=0; for j in N: for i in M: ...; ij += 1` – Jason Orendorff Jan 22 '10 at 13:40

score 0 · Answer 3 · answered Dec 30 '20 at 15:48

0

I am not sure why do you want to collapse loops. Make sure the most inner loop has a high trip count (by loop inversion) and make sure your data is sequential in memory. I've seen algorithms run 10 times faster when these two conditions are met.

answered Dec 30 '20 at 15:48

Bogi

2,274
5
26
34

efficient loop collapse

3 Answers3