8

I have the following C++ code snippet (the C++ part is the profiler class which is omitted here), compiled with VS2010 (64bit Intel machine). The code simply multiplies an array of floats (arr2) with a scalar, and puts the result into another array (arr1):

int M = 150, N = 150;
int niter = 20000; // do many iterations to have a significant run-time
float *arr1 = (float *)calloc (M*N, sizeof(float));
float *arr2 = (float *)calloc (M*N, sizeof(float));

// Read data from file into arr2

float scale = float(6.6e-14);

// START_PROFILING
for (int iter = 0; iter < niter; ++iter) {
    for (int n = 0; n < M*N; ++n) {         
        arr1[n] += scale * arr2[n];
    }
}
// END_PROFILING

free(arr1);
free(arr2); 

The reading-from-file part and profiling (i.e run-time measurement) is omitted here for simplicity.

When arr2 is initialized to random numbers in the range [0 1], the code runs about 10 times faster as compared to a case where arr2 is initialized to a sparse array in which about 2/3 of the values are zeros. I have played with the compiler options /fp and /O, which changed the run-time a little bit, but the ratio of 1:10 was approximately kept.

  • How come the performance is dependent on the actual values? What does the CPU do differently that makes the sparse data run ~10 times slower?
  • Is there a way to make the "slow data" run faster, or will any optimization (e.g vectorizing the calculation) have the same effect on both arrays (i.e, the "slow data" will still run slower then the "fast data")?

EDIT

Complete code is here: https://gist.github.com/1676742, the command line for compiling is in a comment in test.cpp.

The data files are here:

Itamar Katz
  • 9,544
  • 5
  • 42
  • 74
  • 2
    Please could you provide complete, compilable versions of the two tests, so that we can experiment with them? – NPE Jan 25 '12 at 15:03
  • Could it be that when you pass `0` to your float in your sparse matrix, the conversion from `int` to `float` introduces some overhead? – Tony The Lion Jan 25 '12 at 15:10
  • You only update the elements which aren't 0? So, can it be that the zeros are not in the cache? – duedl0r Jan 25 '12 at 15:11
  • @duedl0r: That thought occurred to me too. However, this argument would only apply to the first iteration of 20,000 since the addition loop traverses the entire array. – NPE Jan 25 '12 at 15:12
  • @aix: hmm yes, you might be right. But you never know that the compiler generates :) In this case he can switch the for loops, so the `niter` loop is the inside loop, no? Maybe he even generates `arr1[n] += niter * scale * arr2[n]` if he's really crazy ;) – duedl0r Jan 25 '12 at 15:20
  • I do not pass `0`s since I read binary data from files. See my edited answer for links to code and data. – Itamar Katz Jan 25 '12 at 15:20
  • @aix: I actually didn't mean to accumulate the sum over all iteration, I've just forgot a `memset` line - but anyway it doesn't change the results of the run-duration. – Itamar Katz Jan 25 '12 at 15:35
  • Question for the FPU experts: Could it be that the FPU considers 0.0 a subnormal number in this scenario? FP ops including subnormals are slower than ordinary ops on some FPUs, so that could explain it if 0 is considered subnormal. – Daniel Fischer Jan 25 '12 at 16:00
  • Is this homework? I ask because the numbers in the data files seem to misdirect how you're interpreting the results. – Mark Synowiec Jan 25 '12 at 16:05
  • @DanielFischer, Exact zeros (0.0) are not considered as denormalized and are processed quickly. – Evgeny Kluev Jan 25 '12 at 16:10
  • @EvgenyKluev I hoped and expected that, but I didn't know. Thanks. – Daniel Fischer Jan 25 '12 at 16:47
  • @MarkSynowiec: I don't understand your comment - but no, this is not homework. – Itamar Katz Jan 26 '12 at 07:12
  • Funny, [another question with the same underlying problem](http://stackoverflow.com/questions/9314534/why-does-changing-0-1f-to-0-slow-down-performance-by-10x) showed up a mere 3 weeks after this one. – Mysticial Aug 29 '12 at 04:58

3 Answers3

7

Probably that's because your "fast" data consists only of normal floating point numbers, but your "slow" data contains lots of denormalized numbers.

As for your second question, you can try to improve speed with this (and treat all denormalized numbers as exact zeros):

#include <xmmintrin.h>
_mm_setcsr(_mm_getcsr() | 0x8040);
Evgeny Kluev
  • 24,287
  • 7
  • 55
  • 98
2

I can think of two reasons for this.

First, the branch predictor may be making incorrect decisions. This is one potential cause of performance gaps caused by data changes without code changes. However, in this case, it seems very unlikely.

The second possible reason is that your "mostly zeros" data doesn't really consist of zeros, but rather of almost-zeros, or that you're keeping arr1 in the almost-zero range. See this Wikipedia link.

Borealid
  • 95,191
  • 9
  • 106
  • 122
1

There is nothing strange that the data from I.bin takes longer to process: you have lots of numbers like '1.401e-045#DEN' or '2.214e-043#DEN', where #DEN means the number cannot be normalized to the standard float precision. Given that you are going to multiply it by 6.6e-14 you'll definitely have underflow exceptions, which significantly slows down calculations.

Arty
  • 723
  • 5
  • 10