2

I am trying to figure out why "Choice A" performs better that "Choice B". My test shows something like 228 vs 830 or there about, it's like a 4 x difference. Looking at the IL, the untrained eye doesn't pick-out the subtly between the 2 calls.

Thank you, Stephen

const int SIZE = 10000;
void Main()
{
    var sw = Stopwatch.StartNew();

    int[,]A = new int[SIZE, SIZE];
    int total, x, y;

    // Choice A
    total = 0;
    for (x = 0; x < SIZE; x++)
    {
        for (y = 0; y < SIZE; y++)
        {
            total += A[x, y];
        }
    }

    Console.WriteLine(sw.ElapsedMilliseconds);

    sw.Reset();
    sw.Start();

    // Choice B
    total = 0;
    for (y = 0; y < SIZE; y++)
    {
        for (x = 0; x < SIZE; x++)
        {
            total += A[x, y];
        }
    }

    Console.WriteLine(sw.ElapsedMilliseconds);
}

// Define other methods and classes here

Ok, I broke this out so that they would run independently of each other and mitigate any caching and or diagnostics... and B is ALWAYS coming in behind A

namespace ConsoleApplication1
{
    class ProgramA
    {
        const int SIZE = 10000;
        static void Main(string[] args)
        {
            var sw = Stopwatch.StartNew();

            int[,] A = new int[SIZE, SIZE];
            int total, x, y;

            // Choice A
            total = 0;
            for (x = 0; x < SIZE; x++)
            {
                for (y = 0; y < SIZE; y++)
                {
                    total += A[x, y];
                }
            }

            Console.WriteLine(sw.ElapsedMilliseconds);
            Console.ReadLine();
        }
    }

    class ProgramB
    {
        const int SIZE = 10000;
        static void Main(string[] args)
        {
            var sw = Stopwatch.StartNew();

            int[,] A = new int[SIZE, SIZE];
            int total, x, y;

            // Choice B
            total = 0;
            for (y = 0; y < SIZE; y++)
            {
                for (x = 0; x < SIZE; x++)
                {
                    total += A[x, y];
                }
            }

            Console.WriteLine(sw.ElapsedMilliseconds);
            Console.ReadLine();
        }
    }
}
Stephen Patten
  • 6,333
  • 10
  • 50
  • 84
  • I think the instrumentation code is a little bit suspect, but I *would* expect A to perform better since it more cache-friendly. http://stackoverflow.com/questions/763262/cache-efficient-code/763280#763280 – Ani Dec 09 '10 at 03:55
  • Ran the same code sample on my machine. Choice B is only slightly slower than Choice A. Choice A = 1563 ms Choice B = 2121 ms – Joseph Sturtevant Dec 09 '10 at 03:59
  • Oh man I've seen this question asked a few years ago and I can't remember the answer. I think it's because the second one creates a new dimension on each iteration while the first one has the first dimension already. – Phill Dec 09 '10 at 04:01
  • @Joseph: The magnitude of the difference between these two options is likely to be highly processor-dependent. – Anon. Dec 09 '10 at 04:01
  • @Phill: Is [this](http://stackoverflow.com/questions/997212/fastest-way-to-loop-through-a-2d-array) perhaps the similar question you were looking for? And [this](http://lbrandy.com/blog/2009/07/computational-performance-a-beginners-case-study/) blog entry as a reflection from that post. Although it appears to me that that example indicates choice *B* should be faster... – Cody Gray - on strike Dec 09 '10 at 04:26
  • Question wasn't posted on stack, I was reading a blog post where someone had emailed the guy directly and it was explained, it wasn't that blog post tho. I wish i could find it :( – Phill Dec 09 '10 at 04:41
  • Joseph, make sure you are not testing debug build in VS. Also you can have CPU with either tiny or huge cache that would level the difference. – Alexei Levenkov Dec 09 '10 at 04:52
  • @Cody Gray, I just remembered, the blog I was reading was using a bit-map image in order to explain the horizontal vs vertical processing. – Phill Dec 09 '10 at 04:55
  • Now, that's interesting... I'd need to see the IL, but maybe the subtle difference is because how the data is segmented, and the time it takes to locate the right cell? Note: this should actually be a comment, but I don't have that "privilege" yet ¬¬' – void Jan 11 '11 at 08:27

3 Answers3

6

At a guess, cache effects would be the big one here.

A two-dimensional array is layed out in memory like so:

(0, 0) (0, 1) (0, 2) (0, 3) (1, 0) (1, 1) (1, 2) ...

In option A, you're accessing successive elements in memory - this means that when the CPU fetches a cache line, it gets several successive elements. While option B is jumping around through memory. Thus option B requires significantly more memory accesses once the array becomes larger than the cache size.

Anon.
  • 58,739
  • 8
  • 81
  • 86
1

Ahh I think I remember.

If you think of a 2d array as a table in memory, the first value is the row, the second value is a column.

[0, 0] [0, 1] [0, 2] [0, 3]... [1, 0] [1, 1] [1, 2] [1, 3]...

When you iterate over it, the first loop is the row, the second loop is the column. It's quicker to iterate by doing foreach row, assign each column.

In the second scenario it's values are assigned as

[0, 0] [1, 0] [2, 0] [3, 0]... [0, 1] [1, 1] [2, 1] [3, 1]...

So this is slower because your looping, you're assigning foreach column, foreach row. You're only assigning the first column, for each row.

Does that make sense?

Edit: This was one of the things I was looking for:

http://en.wikipedia.org/wiki/Row-major_order

In row-major storage, a multidimensional array in linear memory is accessed such that rows are stored one after the other.

So when iterating over a row at a time, it's not jumping around memory looking for each next row to assign the value to the column, it has the row, assigns all columns, then jumps to the next row in memory.

Phill
  • 18,398
  • 7
  • 62
  • 102
0

To expand upon the cacheing answers:

The values in question are 4 bytes each and IIRC current memory architecture reads 16 byte lines from memory assuming a properly populated motherboard. (I don't know about DDR3, it's three-chip nature suggests the reads are even bigger.) Thus when you read a line of memory you get 4 values.

When you do it the first way you use all of these values before going back to the memory for the next line. Done the second way you use only one of them and it then gets flushed from the on-chip cache long before it's called for again.

Loren Pechtel
  • 8,945
  • 3
  • 33
  • 45