3

I have been requested by a friend that I share what I have stumbled upon some time in the past. The original post is taken from here. The problem statement can be found here. Basically a site for algorithmic competitions.

I was put infront of a algorithmic problem that I solved using the following code:

double dp[80002][50];
class FoxListeningToMusic {
public:
    vector <double> getProbabilities(vector <int> length, int T)  {    
        memset(dp, 0, sizeof(dp));
        int n = length.size();
        for(int i = 0; i < n; i++)
            dp[0][i] = 1.0 / (double)n;

        double mul = 1.0 / (double)n;
        int idx ;
        for(int i = 1; i <= T; i++) {
            for(int j = 0; j < n; j++)  {
                idx = i - length[j];
                if(idx >= 0)  {
                    for(int k = 0; k < n; k++)
                        dp[i][k] += mul * dp[idx][k];
                }
                else
                    dp[i][j] += mul;
                }
            }
        }

        vector<double> v(n);
        for(int i = 0; i < n; i++)
            v[i] = dp[T][i];
        return v;
    }

};

It is not important weather the code is solving the problem with correct answers, at least for what I am going to discuss. The fact is that I got time limit with this code (meaning it executed for more than 2 seconds on some test cases). It was somehow expected as the complexity here is O(T * length.size() ^ 2), which becomes 2 * 108 if we take into account the problem constraints. However, the interesting thing is that I tested my solution especially agains the time limit. The case I used seems to be "worst case" for my solution: 50 1s given in length and T = 80000. The code ran for 0.75 seconds. This is quite below the time limit of 2 seconds.

I say the case I used is the worst case, because the number of instructions what will be executed depends only on the branching condition idx >= 0 in the inner for. If this is true one more cycle is to be executed (the cycle is of complexity O(n)). In the other case only a single operation O(1) will be executed. And as you can see the less the elements in length the more times this becomes true.

Even though this reasoning, my problem fails after testing on the following case:

length = {1, 1, 1, 1, 3, 3, 3, 3, 1, 3, 3, 2, 3, 2, 3, 3, 1, 2, 3, 1, 2, 3, 2,
          1, 3, 1, 1, 1, 2, 3, 2, 3, 2, 2, 1, 3, 1, 1, 3, 1, 3, 1, 3, 2, 3, 1,
          1, 3, 2, 76393} T= 77297.
For this case my program runs for 5.204000 seconds.

My first assumption was that the reason for this unexpected ratio of runtime measurements (as long as we should expect that in the first case quite fewer processor instructions are to be executed) was that the processor caches somehow the similar calculations: in my example the calculations are symmetric with regard to all the elements of length and really clever processor can use this to spare repeating the same sequence of instructions. So I tried composing another example: this time with different values in length array:

length = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
          21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
          39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 77943}
T=80000 runs for  0.813000 seconds. 

After this example I am no longer able to say how come these time measures - my second example seems to require more processor instructions than the test that failed me and does not allow the caching I thought was happening in the first example. Actually I have not been able to define the cause of this behaviour, but I am quite sure it should have something to do either with processor caches or conveyors. I am very curios how those experiments will behave on different chipsets, so feel free to comment here.

Also, if there is anyone more knowledgeable in hardware than me and he/she can explain this behaviour it will be appreciated.

Until then there is a note I should make for myself - when estimating the algorithm complexity do not underestimate the processor optimizations. Sometimes, they seem to decrease/increase significantly the amortized speed of specific examples.

Boris Strandjev
  • 46,145
  • 15
  • 108
  • 135
  • possible duplicate of [Why does changing 0.1f to 0 slow down performance by 10x?](http://stackoverflow.com/questions/9314534/why-does-changing-0-1f-to-0-slow-down-performance-by-10x) – Eric Postpischil Oct 12 '13 at 14:00

2 Answers2

7

The reason for this weird behaviour turned out to be the denormal numbers. Putting code to treat such numbers as pure zeroes sped up my code immensely on such corner cases.

HINT: Denormal numbers in this case are numbers that are pretty close to zero (e.g. 10-38 for floats; correction due to @PascalCuoq). For such numbers the processor gets a lot slower in processing, because of this: (taken from wikipedia):

Some systems handle denormal values in hardware, in the same way as normal values. Others leave the handling of denormal values to system software, only handling normal values and zero in hardware. Handling denormal values in software always leads to a significant decrease in performance.

EDIT I also found this suggestion on SO how you can check if the number became denormal.

Community
  • 1
  • 1
Boris Strandjev
  • 46,145
  • 15
  • 108
  • 135
  • What operating system and compiler are you using? – Violet Giraffe Oct 12 '13 at 10:22
  • So what did you do? Enable the Denormals Are Zero flag? Or Flush To Zero? Both? (also, what microarchitecture?) – harold Oct 12 '13 at 10:23
  • @harold I placed one check of `dp[i][k] = (dp[i][k] < VERY_SMALL_EPSYLON) ? 0 : dp[i][k];` after the innermost cycle. This was part of competition. I did not have the right to change the flags. – Boris Strandjev Oct 12 '13 at 10:29
  • @VioletGiraffe G++ and Visual studio - both. – Boris Strandjev Oct 12 '13 at 10:29
  • 1
    “e.g. 10^-20”?! The largest denormal in single-precision is around 10^-38. You are off by a factor of nearly 10^18 for single-precision denormals (and 10^288 for double-precision). – Pascal Cuoq Oct 12 '13 at 14:07
  • @PascalCuoq: Ok thanks for the correction. I will use the constant you provide. Actually I never measured on any system what is the number to get to denormal numbers. For me and my particular solution it was sufficient to treat everything below 10^-20 as zero, but I guess it is important to have the correct limits. – Boris Strandjev Oct 12 '13 at 14:59
  • 1
    @BorisStrandjev I didn't mention it because harold already had, but the usual solution is to let the FPU map denormal values to zero. Your solution has its advantages (portability, or for instance if you integrate several numerical libraries that have their own requirements on the FPU state to work properly) but configuring the FPU to flush to zero means not having to worry about what the limit is, among other things. – Pascal Cuoq Oct 12 '13 at 15:03
  • If you cannot do it by FPU configuration, it may be better to use a larger constant, program requirements permitting. Doing so may reduce the number of calculations with very small normal number inputs that produce a denormalized result. – Patricia Shanahan Oct 12 '13 at 15:31
1

Another option to deal with this kind of situation is to use fixed-point operations and avoid floating points altogether. The problem statement requires that the answer be accurate up to 1e-9, and since 2^64 is about 10^19, and you are only doing 80000 iterations at most, this is plenty of precision. The way this would work is define a large constant, say

const uint64_t ONE = pow(10,17);

You initialize your array of uint64_t to ONE/n rather than 1.0/double(n), and the main loop would look like this:

  for(int i = 1; i <= T; i++) {
    for(int j = 0; j < n; j++)  {
      idx = i - length[j];

      if(idx >= 0)  {
        for(int k = 0; k < n; k++){
          dpi[i][k] += dpi[idx][k];
        }
      }    
      else
        dpi[i][j] += ONE;

    }
    for(int k = 0; k < n; k++){
      dpi[i][k] = dpi[i][k]/n;
    }
  }

In theory, this should be faster, since you are avoiding floating point operations in the main loop, and the inner loop consists only of integer additions. On my machine the performance improvement is only about 10%, which indicates that the real bottleneck might be memory access. But, in other situations, you might see a bigger performance improvement.

mrip
  • 14,913
  • 4
  • 40
  • 58
  • Hmm i am worried about the calculations of `/n` they will be imprecise in your approach. Have you tried actually passing the problem with this approach? – Boris Strandjev Oct 12 '13 at 17:11
  • Yes, I tried a few inputs, including the ones you provided. The difference between this and the `double` solution is on the order of `1e-15`. – mrip Oct 12 '13 at 17:15
  • Sorry for the very late response, but I now tested your approach and confirm that it works and the problem is also solved using it. Thank you for sharing yet another approach for the particular problem. – Boris Strandjev Dec 15 '13 at 10:43