The most important rule is that it's all theory until you profile. I don't hold with those who insist that profiling is everything (without some theory you're no better than a Cargo Cultist putting coconuts on their ears and waiting for the plane to come) but your theory can always be wrong or incomplete, so profiling is crucial.
Generally, we want the inner-scan to be horizontally (in terms of the array, rather than the image, though for most formats that's the same). The reason being that with an array like:
00 01 02 03 04 05 06 07 08 09
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
It is going to be laid out as:
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
You want to be scanning along contiguous blocks that can be loaded into CPU caches and then used entirely, rather than scanning from block to block and needing to regularly change the CPU cache contents.
This is even more important if you try to parallelise the algorithm. You want each thread dealing with its own contiguous blocks of memory as far as both input and output goes, rather than not only suffering the way single-threaded code does with poor cache-hit-frequence but also causing each other's buffers to be dirtied and need refreshing. This can be the difference between parallelising leading to a speed boost and parallelising actually slowing things down.
Another thing is the difference between a 2-dimensional array byte[,]
rather than an array of arrays byte[][]
, which your comment in your question "array[y][x]" makes me wonder if perhaps you are using. With the former to obtain arr[1,2] the logic is:
- Check Bounds
- Calculate position (simple fast arithmetic)
- Retrieve value.
With the latter, the logic is:
- Check bounds
- Obtain array through pointer.
- Check bounds
- Retrieve value.
There is also less good memory cache-hit-frequence. The latter has benefits when "jagged" structures are needed, but that is not the case here. 2D is almost always faster than array of arrays.
Things I don't see as likely to help, but I would certainly try them in your situation:
You may find a boost from doing your 1d <=> 2d logic. Have a single-dimension array where idx = y * width + x. It shouldn't make an appreciable difference, but it's worth trying.
Optimisations do try to both hoist calls to .Length
and omit needless bounds checking, so you may find manually hoisting and switching to pointer arithmetic doesn't gain anything, but in a case where you really do need to bring time down it is certainly worth profiling.
Finally. Have you profiled how fast your code is at scanning the array and doing nothing? It could be that another part of the code is the real bottleneck, and you're fixing the wrong thing.