6

I am writing a code for a mathematical method (Incomplete Cholesky) and I have hit a curious roadblock. Please see the following simplified code.

for(k=0;k<nosUnknowns;k++)
{
    //Pieces of code
    for(i=k+1;i<nosUnknowns;i++)
    {
        // more code
    }
    for(j=k+1;j<nosUnknowns;j++)
    {
        for(i=j;i<nosUnknowns;i++)
        {
            //Some more code 
            if(xOk && yOk && zOk)
            {
                if(xDF == 1 && yDF == 0 && zDF == 0)
                {
                    for(row=0;row<3;row++)
                    {
                        for(col=0;col<3;col++)
                        {
                            // All 3x3 static arrays This is the line
                            statObj->A1_[row][col] -= localFuncArr[row][col];
                        }
                    }
                }
            }
        }//Inner loop i ends here
    }//Inner loop j ends here
}//outer loop k ends here

For context,

statObj is an object containing a number of 3x3 static double arrays. I am initializing statObj by a call to new function. Then I am populating the arrays inside it using some mathematical functions. One such array is A1_. The value of variable nosUnknowns is around 3000. The array localFuncArr is previously generated by matrix multiplication and is a double array.

Now this is my problem:

  1. When I use the line as shown in the code, the code runs extremely sluggishly. Something like 245secs for the whole function.

  2. When I comment out the said line, the code performs extremely fast. It takes something like 6 secs.

  3. Now when I replace the said line with the following line : localFuncArr[row][col] += 3.0, again the code runs with the same speed as that of case(2) above.

Clearly something about the call to statObj->A1_ is making the code run slow.

My question(s):

  1. Is Cache Poisoning the reason why this is happening ?

  2. If so, what could be changed in terms of array initialization/object initialization/loop unrolling or for that matter any form of code optimization that can speed this up ?

Any insights to this from experienced folks is highly appreciated.

EDIT: Changed the description to be more verbose and redress some of the points mentioned in the comments.

Ulrich Eckhardt
  • 16,572
  • 3
  • 28
  • 55
rajaditya_m
  • 289
  • 1
  • 4
  • 14
  • 15
    If you comment out the line in the innermost loop chances are those loops are all optimized away because they don't do anything. Hence the speed difference. – Borgleader Jul 17 '13 at 23:41
  • 2
    What line is "this line"? If you mean `statObj->A1_[row][col] -= l ...`, then without that, the code (or at least its inner loops, depending on what stuff like "Some more code" is) does *nothing at all*. – David Schwartz Jul 17 '13 at 23:41
  • 12
    If I were you, I'd add more loops, and preferably nested ones, so that things really become hard to read. – Tony The Lion Jul 17 '13 at 23:42
  • 7
    If the question is, "Is it cache poisoning or is it something else?", then the answer is most definitely "yes". – Kerrek SB Jul 17 '13 at 23:43
  • 1
    If `statObj` and `localFuncArr` can't possibly overlap, make sure the compiler knows this. Otherwise, it has much less flexibility in optimizing the nested loops. – David Schwartz Jul 17 '13 at 23:45
  • @Borgleader : I have made some edits. I have actually not commented out the whole loop. – rajaditya_m Jul 17 '13 at 23:46
  • @rajaditya_m: The loops have no observable effect: They do nothing. – Kerrek SB Jul 17 '13 at 23:47
  • @KerrekSB : Thank you. I was hoping someone would say NO and point out to a silly error which I would correct. Now it becomes hard to debug even more :( – rajaditya_m Jul 17 '13 at 23:47
  • @rajaditya_m If you comment out that line. The two innermost loops dont have any code in them. So they're most likely optimized out by the compiler. – Borgleader Jul 17 '13 at 23:47
  • @Borgleader : If I replace the line by something like localFuncArr[row][col] += 3.0, it also runs with the same speed. – rajaditya_m Jul 17 '13 at 23:48
  • @rajaditya_m: Sorry, I was just being snarky, as in "Yes, it is cache poisoning or something else", in reference to your not very useful question title. The real answer is that you haven't presented a very good test case (ideally with a baseline to compare against), and/or generated machine code which one could inspect for crucial differences. Memory locality may well be an issue here, but it may also be something you just have to live with. – Kerrek SB Jul 17 '13 at 23:51
  • @TonyTheLion : I tried to make it as readable as possible. However the structure of the program dictates the usage of so many loops. – rajaditya_m Jul 17 '13 at 23:51
  • @Mystical: Are you sure that's what the OP meant? – Kerrek SB Jul 17 '13 at 23:53
  • 2
    60x slower when using arithmetic... That sounds to me like a case of denormalized floating-point numbers. What values are you storing in those – paddy Jul 17 '13 at 23:53
  • @KerrekSB : I made the title a little more technical. Also I would say that I am using Visual Studio instead of gcc. A lot of people suggested to me that tweaking around the compiler flags may produce more optimized code. – rajaditya_m Jul 17 '13 at 23:54
  • It doesn't matter which compiler you use, you should still be able to make it optimize your code and show you the generated assembly. – Kerrek SB Jul 17 '13 at 23:55
  • @KerrekSB It is about performance isn't it? All I did was add the tag. – Mysticial Jul 17 '13 at 23:56
  • 1
    When your innermost loop says `anything[row][col]` the index calculations are going to be costly. I would just increment pointers. In fact, I would unroll that whole inner pair of loops and use constant indexes. Somebody is bound to say "but the compiler should do that". Well, maybe, and maybe not. – Mike Dunlavey Jul 17 '13 at 23:57
  • @Mysticial: I think you edited some assertions about which piece of code change would behave in which way, and I wasn't sure if that's what the OP meant. It was just unclear to me, but you may have understood it better. – Kerrek SB Jul 17 '13 at 23:58
  • @KerrekSB All I did was add the tag. I didn't edit the body of the question. – Mysticial Jul 17 '13 at 23:59
  • @Mysticial: Oh OK, never mind then. Maybe several different edits got merged into one in the history. – Kerrek SB Jul 17 '13 at 23:59
  • @rajaditya_m paddy has a good point here. Are you initializing the data? If they are not initialized, it's very likely that some of it will contain [denormalized data - which in incur massive penalties](http://stackoverflow.com/questions/9314534/why-does-changing-0-1f-to-0-slow-down-performance-by-10x). Try zero initializing the data, then running it with arithmetic. – Mysticial Jul 18 '13 at 00:17
  • @GManNickG, I have changed the description. If you think this is clear please remove the hold. Thanks – rajaditya_m Jul 18 '13 at 03:19
  • @Borgleader : I have changed the description. If you think this is clear please remove the hold. Thanks – rajaditya_m Jul 18 '13 at 03:20
  • @Kerrek SB: have changed the description. If you think this is clear please remove the hold. Thanks – rajaditya_m Jul 18 '13 at 03:20
  • @Tony The Lion: I have changed the description. If you think this is clear please remove the hold. Thanks – rajaditya_m Jul 18 '13 at 03:20
  • @Rapptz: I have changed the description. If you think this is clear please remove the hold. Thanks – rajaditya_m Jul 18 '13 at 03:21
  • @rajaditya_m, I predict you will have trouble reopening this question. So in the meantime, some helpful advice: profile this code using modern CPU performance counters (look for tools like `perf`, `oprofile` or `VTune`). They can tell you how often you're seeing last level cache misses, Floating point assists (denormal operands), etc. – Brian Cain Jul 18 '13 at 03:33
  • 1
    I voted to reopen this question. IMO it's closer to "clear what's being asked" than many questions which deserve this closure. @rajaditya_m, please generate a [SSCCE](http://sscce.org/) in order to get more assistance. – Brian Cain Jul 18 '13 at 03:39
  • One part of your question is whether unnecessarily much memory bandwidth is used. Concerning this, there's two things I would try. The first is to simply swap the rows/columns in the inner loop, though I doubt that this changes much, considering the size. Then, I would try to switch to using plain float instead of double, which halves the memory bandwidth requirements. The best advise here is to use real profiling tools though, without them, everything here is just guessing. – Ulrich Eckhardt Jul 18 '13 at 05:29
  • 1
    How are those two-dimensional arrays defined? Are they genuine two-dimensional arrays, or are they arrays of arrays? – Kerrek SB Jul 18 '13 at 07:53
  • 1
    Can you please post a minimal, self-contained example? I tried reproducing this, with no luck. I seem to get some cleverly vectorised code that runs efficiently both with the constant RHS and with the array on the RHS. – Kerrek SB Jul 18 '13 at 08:17
  • Another guess: pointer aliasing? Try `__restrict`. – chirlu Jul 18 '13 at 09:14
  • In the offending loop, you have the comment `// All 3x3 static arrays This is the line`. Does this mean you are also operating on other arrays of floating point numbers in the same loop, and removed those other arrays to make it easier for us to read, but only this array is giving you trouble? – Sam Skuce Jul 18 '13 at 14:52
  • BTW, I cannot find right definition of Cache Poisoning in google. Anyone can share a link? Just curious. – Display Name Jul 18 '13 at 15:08
  • @chirlu : Can you give me a specific link about this __restrict_ ? – rajaditya_m Jul 18 '13 at 16:05
  • http://en.wikipedia.org/wiki/Restrict - most C++ compilers support this as well, including MSVC (but prefixed with _two_ underscores). – chirlu Jul 18 '13 at 16:27

1 Answers1

4

If the conditions are mostly true, your line of code is executed 3000x3000x3000x3x3 times. That's about 245 billion times. Depending on your hardware architecture 245 seconds might be a very reasonable timing (that's 1 iteration every 2 cycles - assuming 2GHz processor). In any case there isn't anything in the code that suggests cache poisoning.

Come Raczy
  • 1,590
  • 17
  • 26
  • You are absolutely right. However thiscode is for a sparse matrix and so the inner one is not more than 20 times hence such high runtimes are a mismatch – rajaditya_m Jul 18 '13 at 16:05