Faster computation of (approximate) variance needed

Question

I can see with the CPU profiler, that the compute_variances() is the bottleneck of my project.

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 75.63      5.43     5.43       40   135.75   135.75  compute_variances(unsigned int, std::vector<Point, std::allocator<Point> > const&, float*, float*, unsigned int*)
 19.08      6.80     1.37                             readDivisionSpace(Division_Euclidean_space&, char*)
 ...

Here is the body of the function:

void compute_variances(size_t t, const std::vector<Point>& points, float* avg,
                       float* var, size_t* split_dims) {
  for (size_t d = 0; d < points[0].dim(); d++) {
    avg[d] = 0.0;
    var[d] = 0.0;
  }
  float delta, n;
  for (size_t i = 0; i < points.size(); ++i) {
    n = 1.0 + i;
    for (size_t d = 0; d < points[0].dim(); ++d) {
      delta = (points[i][d]) - avg[d];
      avg[d] += delta / n;
      var[d] += delta * ((points[i][d]) - avg[d]);
    }
  }

  /* Find t dimensions with largest scaled variance. */
  kthLargest(var, points[0].dim(), t, split_dims);
}

where kthLargest() doesn't seem to be a problem, since I see that:

0.00 7.18 0.00 40 0.00 0.00 kthLargest(float*, int, int, unsigned int*)

The compute_variances() takes a vector of vectors of floats (i.e. a vector of Points, where Points is a class I have implemented) and computes the variance of them, in each dimension (with regard to the algorithm of Knuth).

Here is how I call the function:

float avg[(*points)[0].dim()];
float var[(*points)[0].dim()];
size_t split_dims[t];

compute_variances(t, *points, avg, var, split_dims);

The question is, can I do better? I would really happy to pay the trade-off between speed and approximate computation of variances. Or maybe I could make the code more cache friendly or something?

I compiled like this:

g++ main_noTime.cpp -std=c++0x -p -pg -O3 -o eg

Notice, that before edit, I had used -o3, not with a capital 'o'. Thanks to ypnos, I compiled now with the optimization flag -O3. I am sure that there was a difference between them, since I performed time measurements with one of these methods in my pseudo-site.

Note that now, compute_variances is dominating the overall project's time!

[EDIT]

copute_variances() is called 40 times.

Per 10 calls, the following hold true:

points.size() = 1000   and points[0].dim = 10000
points.size() = 10000  and points[0].dim = 100
points.size() = 10000  and points[0].dim = 10000
points.size() = 100000 and points[0].dim = 100

Each call handles different data.

Q: How fast is access to points[i][d]?

A: point[i] is just the i-th element of std::vector, where the second [], is implemented as this, in the Point class.

const FT& operator [](const int i) const {
  if (i < (int) coords.size() && i >= 0)
     return coords.at(i);
  else {
     std::cout << "Error at Point::[]" << std::endl;
     exit(1);
  }
  return coords[0];  // Clear -Wall warning 
}

where coords is a std::vector of float values. This seems a bit heavy, but shouldn't the compiler be smart enough to predict correctly that the branch is always true? (I mean after the cold start). Moreover, the std::vector.at() is supposed to be constant time (as said in the ref). I changed this to have only .at() in the body of the function and the time measurements remained, pretty much, the same.

The division in the compute_variances() is for sure something heavy! However, Knuth's algorithm was a numerical stable one and I was not able to find another algorithm, that would de both numerical stable and without division.

Note that I am not interesting in parallelism right now.

[EDIT.2]

Minimal example of Point class (I think I didn't forget to show something):

class Point {
 public:

  typedef float FT;

  ...

  /**
   * Get dimension of point.
   */
  size_t dim() const {
    return coords.size();
  }

  /**
   * Operator that returns the coordinate at the given index.
   * @param i - index of the coordinate
   * @return the coordinate at index i
   */
  FT& operator [](const int i) {
    return coords.at(i);
    //it's the same if I have the commented code below
    /*if (i < (int) coords.size() && i >= 0)
      return coords.at(i);
    else {
      std::cout << "Error at Point::[]" << std::endl;
      exit(1);
    }
    return coords[0];  // Clear -Wall warning*/
  }

  /**
   * Operator that returns the coordinate at the given index. (constant)
   * @param i - index of the coordinate
   * @return the coordinate at index i
   */
  const FT& operator [](const int i) const {
        return coords.at(i);
    /*if (i < (int) coords.size() && i >= 0)
      return coords.at(i);
    else {
      std::cout << "Error at Point::[]" << std::endl;
      exit(1);
    }
    return coords[0];  // Clear -Wall warning*/
  }

 private:
  std::vector<FT> coords;
};

What are values - points.size and points[0].dim - for 32 seconds run? How fast is access to points[i][d]? — MBo, May 29 '14 at 14:10
A couple of super-quick observations: if the dimensionality of your points is uniform per `compute_variances()` batch, templatizing the function on that value might let the compiler unroll the inner loops more efficiently. Also, as others said, the divide-per-iteration is probably dominating your time costs. You can't pre-sort your array for numerical stability only once because there are multiple dimensions, but you may want to consider other variance algorithms if this is your bottleneck. — Jeff, May 29 '14 at 15:15
@MBo updated. Jeff I am not sure what you mean by uniformly dimensionality, but I think that the update will cover you too. — gsamaras, May 29 '14 at 18:51
We really need to see the *entire* declaration of the `Point` class. For example, if `dim()` is not a const method, the inner loop will be quite inefficient. — Kuba hasn't forgotten Monica, May 29 '14 at 20:48
@KubaOber, I was thinking if I should post the `Point` class. `dim()` is const. I will post the minimal `Point` class. — gsamaras, May 29 '14 at 20:49
If you run on the machine you build on, add `-march=native -mtune=native` to your compile flags. — rhashimoto, May 29 '14 at 22:05
Mean is just the average. Variance is just the average squared difference from the mean. You can estimate both of those by looking at a random subset of the points, not all of them, if you only need approximate values. Also, some unrolling could be in order, though your inner loop looks a little weird. — Mike Dunlavey, Jun 23 '14 at 19:29
@MikeDunlavey thanks for your comment. You had it changed with the help of the other guys. As for the random subset I agree, but it heavily depends on the dataset. — gsamaras, Jun 23 '14 at 19:37

ypnos · Answer 1 · 2014-05-29T14:58:36.790

1. SIMD

One easy speedup for this is to use vector instructions (SIMD) for the computation. On x86 that means SSE, AVX instructions. Based on your word length and processor you can get speedups of about x4 or even more. This code here:

for (size_t d = 0; d < points[0].dim(); ++d) {
    delta = (points[i][d]) - avg[d];
    avg[d] += delta / n;
    var[d] += delta * ((points[i][d]) - avg[d]);
}

can be sped-up by doing the computation for four elements at once with SSE. As your code really only processes one single element in each loop iteration, there is no bottleneck. If you go down to 16bit short instead of 32bit float (an approximation then), you can fit eight elements in one instruction. With AVX it would be even more, but you need a recent processor for that.

It is not the solution to your performance problem, but just one of them that can also be combined with others.

2. Micro-parallelizm

The second easy speedup when you have that many loops is to use parallel processing. I typically use Intel TBB, others might suggest OpenMP instead. For this you would probably have to change the loop order. So parallelize over d in the outer loop, not over i.

You can combine both techniques, and if you do it right, on a quadcore with HT you might get a speed-up of 25-30 for the combination without any loss in accuracy.

3. Compiler optimization

First of all maybe it is just a typo here on SO, but it needs to be -O3, not -o3! As a general note, it might be easier for the compiler to optimize your code if you declare the variables delta, n within the scope where you actually use them. You should also try the -funroll-loops compiler option as well as -march. The option to the latter depends on your CPU, but nowadays typically -march core2 is fine (also for recent AMDs), and includes SSE optimizations (but I would not trust the compiler just yet to do that for your loop).

Hi ypnos (==ύπνος?), sorry but I forgot to mention that I am not interested in parallelism(I think it's with an 's', not with a 'z') right now. I have used the first one, one year ago in C, when performing dot products. However, I clearly remember that I failed to transfer my code in C++. Now, I do not even remember how to do it even in C. I see about the 3rd section of yours this: http://stackoverflow.com/questions/10021536/increasing-optimization-level-g However, I did not have any errors, which means that either the compiler ignored '-o3', or it interpreted it like something else. I'll edit — gsamaras, May 29 '14 at 19:10
"As a general note, it might be easier for the compiler to optimize your code if you declare the variables delta, n within the scope where you actually use them." This is something that bothered me when writing the code. Is what you say a fact? Because, I am just thinking that the destruction, creation of the variables will cost some time. Of course, I could write some code and test, but now I have to test all the nice ideas in this page! — gsamaras, May 29 '14 at 19:15
Creating a primitive variable will not end up in an extra CPU instruction. Variables with limited scope are easier to optimize on, e.g. make it a CPU register instead of/without putting it on the stack. In this simple example I expect the compiler to realize that the variable is only used locally but just as a general rule. It is also easier for the programmer to read the code. Still have a look at SIMD instructions, it's really worth it. And yes, my nick is ύπνος, but I'm not greek (I can tell you are from your last name ;). — ypnos, May 30 '14 at 10:59

score 3 · Answer 2 · answered May 29 '14 at 19:52

3

The big problem with your data structure is that it's essentially a vector<vector<float> >. That's a pointer to an array of pointers to arrays of float with some bells and whistles attached. In particular, accessing consecutive Points in the vector doesn't correspond to accessing consecutive memory locations. I bet you see tons and tons of cache misses when you profile this code.

Fix this before horsing around with anything else.

Lower-order concerns include the floating-point division in the inner loop (compute 1/n in the outer loop instead) and the big load-store chain that is your inner loop. You can compute the means and variances of slices of your array using SIMD and combine them at the end, for instance.

The bounds-checking once per access probably doesn't help, either. Get rid of that too, or at least hoist it out of the inner loop; don't assume the compiler knows how to fix that on its own.

answered May 29 '14 at 19:52

tmyklebu

13,915
3
28
57

How to fix the problem you are mentioning with my data structure? How can I move the division in the outer loop (can't see how this can be achieved, since it `delta` is computed in the inner loop)? I am going to test how much faster it is without the bounds-checking. See my edit. – gsamaras May 29 '14 at 20:23
@G.Samaras "can't see how this can be achieved" Multiply instead of dividing. It's after all a constant in the inner loop. – Kuba hasn't forgotten Monica May 29 '14 at 21:09
Kuba I see now! +1 to tmyklebu for being the first proposing how not to use division. – gsamaras May 29 '14 at 21:33
@G.Samaras: `vector::at()` does bounds-checking of its own, I believe. To fix your data structure, throw away the `Point` structure and use a single flat `vector` to store everything, where the first `Point` is at position `0`, the second is at position `dim()`, and so forth. – tmyklebu May 30 '14 at 02:04
You mean to throw away the `vector` I guess. However, this class provides many operators, constructors, etc. . – gsamaras May 30 '14 at 09:31
@G.Samaras: Yes, but the layout in memory of a vector of `Point`s is problematic. – tmyklebu May 30 '14 at 14:12

Kuba hasn't forgotten Monica · Accepted Answer · 2014-05-29T22:48:32.427

Here's what I would do, in guesstimated order of importance:

Return the floating-point from the Point::operator[] by value, not by reference.
Use coords[i] instead of coords.at(i), since you already assert that it's within bounds. The at member checks the bounds. You only need to check it once.
Replace the home-baked error indication/checking in the Point::operator[] with an assert. That's what asserts are for. They are nominally no-ops in release mode - I doubt that you need to check it in release code.
Replace the repeated division with a single division and repeated multiplication.
Remove the need for wasted initialization by unrolling the first two iterations of the outer loop.
To lessen impact of cache misses, run the inner loop alternatively forwards then backwards. This at least gives you a chance at using some cached avg and var. It may in fact remove all cache misses on avg and var if prefetch works on reverse order of iteration, as it well should.
On modern C++ compilers, the std::fill and std::copy can leverage type alignment and have a chance at being faster than the C library memset and memcpy.

The Point::operator[] will have a chance of getting inlined in the release build and can reduce to two machine instructions (effective address computation and floating point load). That's what you want. Of course it must be defined in the header file, otherwise the inlining will only be performed if you enable link-time code generation (a.k.a. LTO).

Note that the Point::operator[]'s body is only equivalent to the single-line return coords.at(i) in a debug build. In a release build the entire body is equivalent to return coords[i], not return coords.at(i).

FT Point::operator[](int i) const {
  assert(i >= 0 && i < (int)coords.size());
  return coords[i];
}

const FT * Point::constData() const {
  return &coords[0];
}

void compute_variances(size_t t, const std::vector<Point>& points, float* avg,
                       float* var, size_t* split_dims)
{
  assert(points.size() > 0);
  const int D = points[0].dim();

  // i = 0, i_n = 1
  assert(D > 0);
#if __cplusplus >= 201103L
  std::copy_n(points[0].constData(), D, avg);
#else
  std::copy(points[0].constData(), points[0].constData() + D, avg);
#endif

  // i = 1, i_n = 0.5
  if (points.size() >= 2) {
    assert(points[1].dim() == D);
    for (int d = D - 1; d >= 0; --d) {
      float const delta = points[1][d] - avg[d];
      avg[d] += delta * 0.5f;
      var[d] = delta * (points[1][d] - avg[d]);
    }
  } else {
    std::fill_n(var, D, 0.0f);
  }

  // i = 2, ...
  for (size_t i = 2; i < points.size(); ) {
    {
      const float i_n = 1.0f / (1.0f + i);
      assert(points[i].dim() == D);
      for (int d = 0; d < D; ++d) {
        float const delta = points[i][d] - avg[d];
        avg[d] += delta * i_n;
        var[d] += delta * (points[i][d] - avg[d]);
      }
    }
    ++ i;
    if (i >= points.size()) break;
    {
      const float i_n = 1.0f / (1.0f + i);
      assert(points[i].dim() == D);      
      for (int d = D - 1; d >= 0; --d) {
        float const delta = points[i][d] - avg[d];
        avg[d] += delta * i_n;
        var[d] += delta * (points[i][d] - avg[d]);
      }
    }
    ++ i;
  }

  /* Find t dimensions with largest scaled variance. */
  kthLargest(var, D, t, split_dims);
}

Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/54750/discussion-between-g-samaras-and-kuba-ober). — gsamaras, May 29 '14 at 22:03
Please, next time you edit your post after the action has passed, notify the OP somehow, since I wouldn't have seen the `std::fill` vs `memcpy` for example if I didn't come back here. :) — gsamaras, Jun 06 '14 at 13:22
@G.Samaras "after the action has passed" Huh? What action? Passed how? — Kuba hasn't forgotten Monica, Jun 06 '14 at 17:14
I mean that after the activity on my question was passed, after nobody was commenting/answering the question and after me accepting your question, your edit would have not be visible by me, unless I was coming to this question again to check something told by another user. — gsamaras, Jun 06 '14 at 19:07
I'm sure you can request edit notification on answers to your questions as a feature. — Kuba hasn't forgotten Monica, Jun 06 '14 at 20:31

Adam Bartha · Answer 4 · 2014-05-29T15:08:15.550

0

for (size_t d = 0; d < points[0].dim(); d++) {
  avg[d] = 0.0;
  var[d] = 0.0;
}

This code could be optimized by simply using memset. The IEEE754 representation of 0.0 in 32bits is 0x00000000. If the dimension is big, it worth it. Something like:

memset((void*)avg, 0, points[0].dim() * sizeof(float));

In your code, you have a lot of calls to points[0].dim(). It would be better to call once at the beginning of the function and store in a variable. Likely, the compiler already does this (since you are using -O3).

The division operations are a lot more expensive (from clock-cycle POV) than other operations (addition, subtraction).

avg[d] += delta / n;

It could make sense, to try to reduce the number of divisions: use partial non-cumulative average calculation, that would result in Dim division operation for N elements (instead of N x Dim); N < points.size()

Huge speedup could be achieved, using Cuda or OpenCL, since the calculation of avg and var could be done simultaneously for each dimension (consider using a GPU).

edited May 29 '14 at 15:08

answered May 29 '14 at 15:02

Adam Bartha

373
1
12

Nice idea with the `memset`, since these are traditional 1D arrays. I am not sure what you mean by partial non-cumulative average. Can you provide an example? I remind you that I am interested in trade-offs between speed and accuracy. +1 for `memset`, but I am eager for the example I mentioned. :) – gsamaras May 29 '14 at 18:53
It turns out that `memset` was not that helpful, since I used the time methods here http://gsamaras.wordpress.com/code/1651-2/, but did not see any difference. I am not removing my +1, since you got a -1 from another user. – gsamaras May 29 '14 at 20:08
In any case, recommending `memset` in a C++ project seems odd, when you've got `std::fill` and `std::fill_n` - those can be, in fact, better performing than `memset`, especially for small numbers of elements. The myth that C is always faster than C++ is just that :) – Kuba hasn't forgotten Monica May 30 '14 at 01:19
Kuba Ober; the input for that method is a pointer, not a vector; We have a lot of elements here. – Adam Bartha May 30 '14 at 07:24
G. Samaras, with partial non cumulative avg I was pointing to a solution with fever divs. Ex N=3, 1 dim: for the first three iterations, the algorithm obtains the sum of the these elements, then compute an average (1 div for N elements), store this partial average; then process the next N elements and so on. At the end, using 1 more div, it can obtain the avg of all values from the partial avgs. If N=points.size() you obtain the classical avg calculation. This works only if the upper limit of the values is considerably smaller than max value of the data-type used; otherwise may overflow. – Adam Bartha May 30 '14 at 07:52

Thomas Matthews · Answer 5 · 2014-05-29T19:43:22.360

0

Another optimization is cache optimization including both data cache and instruction cache.

High level optimization techniques
Data Cache optimizations

Example of data cache optimization & unrolling

for (size_t d = 0; d < points[0].dim(); d += 4)
{
  // Perform loading all at once.
  register const float p1 = points[i][d + 0];
  register const float p2 = points[i][d + 1];
  register const float p3 = points[i][d + 2];
  register const float p4 = points[i][d + 3];

  register const float delta1 = p1 - avg[d+0];
  register const float delta2 = p2 - avg[d+1];
  register const float delta3 = p3 - avg[d+2];
  register const float delta4 = p4 - avg[d+3];

  // Perform calculations
  avg[d + 0] += delta1 / n;
  var[d + 0] += delta1 * ((p1) - avg[d + 0]);

  avg[d + 1] += delta2 / n;
  var[d + 1] += delta2 * ((p2) - avg[d + 1]);

  avg[d + 2] += delta3 / n;
  var[d + 2] += delta3 * ((p3) - avg[d + 2]);

  avg[d + 3] += delta4 / n;
  var[d + 3] += delta4 * ((p4) - avg[d + 3]);
}

This differs from classic loop unrolling in that loading from the matrix is performed as a group at the top of the loop.

Edit 1:
A subtle data optimization is to place the avg and var into a structure. This will ensure that the two arrays are next to each other in memory, sans padding. The data fetching mechanism in processors like datums that are very close to each other. Less chance for data cache miss and better chance to load all of the data into the cache.

edited May 29 '14 at 19:43

answered May 29 '14 at 16:10

Thomas Matthews

56,849
17
98
154

I am looking for exploiting caching (data and instruction). However, don't you believe that the compiler will already do this? Also, do you think that the `register` keyword is needed? I have never used it. http://stackoverflow.com/questions/3207018/register-keyword-in-c/3208184#3208184 – gsamaras May 29 '14 at 18:58
I like using the `register` keyword for both reminding the compiler and readability. One of the compiler's I'm using will not apply the optimization unless set high; which means either the entire source file or nothing. – Thomas Matthews May 29 '14 at 19:35
I don't believe the compiler will optimize automatically for cache loading and unloading. Caches are usually a platform specific issue and not all platforms have the same amount of cache. The *truth is in the assembly language*, so print the assembly language for the function with and without optimizations. Compare and see. Your Mileage May Vary. – Thomas Matthews May 29 '14 at 19:37
I had learned assembly, but I haven't touched it for a long, so I guess, I am going to compare times (with the methods I mentioned in my pseudo-site in my question). – gsamaras May 29 '14 at 19:40
I like the idea with the struct, but how am I going to set the sizes of the statically allocated arrays, since I know the sizes in run-time? I am used to these structs http://gsamaras.wordpress.com/code/structs-c/, but I wouldn't have a problem learning something new! – gsamaras May 29 '14 at 20:18
The `register` keyword is [deprecated in C++11](http://en.cppreference.com/w/cpp/language/storage_duration). Example code should probably not use it. – rhashimoto May 29 '14 at 20:36
@rhashimoto thanks! Maybe you know something about the struct idea Thomas said and how to use it with my arrays? – gsamaras May 29 '14 at 20:44
I un-rolled the loop as suggested in this answer (with or without the `register` keyword) and the timing was the same. – gsamaras May 29 '14 at 20:48
I am still interested in the `struct`. If we can test this too, then this question can get its +1. – gsamaras May 29 '14 at 22:23
See my **Edit 2**. A simple re-arrangement of the data. :-) – Thomas Matthews May 29 '14 at 23:13

score 0 · Answer 6 · answered May 29 '14 at 20:06

You could use Fixed Point math instead of floating point math as an optimization.

Optimization via Fixed Point
Processors love to manipulate integers (signed or unsigned). Floating point may take extra computing power due to the extraction of the parts, performing the math, then reassemblying the parts. One mitigation is to use Fixed Point math.

Simple Example: meters
Given the unit of meters, one could express lengths smaller than a meter by using floating point, such as 3.14159 m. However, the same length can be expressed in a unit of finer detail like millimeters, e.g. 3141.59 mm. For finer resolution, a smaller unit is chosen and the value multiplied, e.g. 3,141,590 um (micrometers). The point is choosing a small enough unit to represent the floating point accuracy as an integer.

The floating point value is converted at input into Fixed Point. All data processing occurs in Fixed Point. The Fixed Point value is convert to Floating Point before outputting.

Power of 2 Fixed Point Base
As with converting from floating point meters to fixed point millimeters, using 1000, one could use a power of 2 instead of 1000. Selecting a power of 2 allows the processor to use bit shifting instead of multiplication or division. Bit shifting by a power of 2 is usually faster than multiplication or division.

Keeping with the theme and accuracy of millimeters, we could use 1024 as the base instead of 1000. Similarly, for higher accuracy, use 65536 or 131072.

Summary
Changing the design or implementation to used Fixed Point math allows the processor to use more integral data processing instructions than floating point. Floating point operations consume more processing power than integral operations in all but specialized processors. Using powers of 2 as the base (or denominator) allows code to use bit shifting instead of multiplication or division. Division and multiplication take more operations than shifting and thus shifting is faster. So rather than optimizing code for execution (such as loop unrolling), one could try using Fixed Point notation rather than floating point.

Whether `float`s, `int`s, etc. is going to be used is determined b the user. So, let's say that (s)he gives us `float`s. We could convert them into `int`s, but this would cancel the benefits of processor. — gsamaras, May 29 '14 at 20:34

score 0 · Answer 7 · edited May 23 '17 at 11:58

Point 1. You're computing the average and the variance at the same time. Is that right? Don't you have to calculate the average first, then once you know it, calculate the sum of squared differences from the average? In addition to being right, it's more likely to help performance than hurt it. Trying to do two things in one loop is not necessarily faster than two consecutive simple loops.

Point 2. Are you aware that there is a way to calculate average and variance at the same time, like this:

double sumsq = 0, sum = 0;
for (i = 0; i < n; i++){
  double xi = x[i];
  sum += xi;
  sumsq += xi * xi;
}
double avg = sum / n;
double avgsq = sumsq / n
double variance = avgsq - avg*avg;

Point 3. The inner loops are doing repetitive indexing. The compiler might be able to optimize that to something minimal, but I wouldn't bet my socks on it.

Point 4. You're using gprof or something like it. The only reasonably reliable number to come out of it is self-time by function. It won't tell you very well how time is spent inside the function. I and many others rely on this method, which takes you straight to the heart of what takes time.

Faster computation of (approximate) variance needed

7 Answers7