3

I was wondering if there's a neater (or better yet, more efficient), method of summing values of a vector/(asymmetric) matrix (a matrix having structure like symmetry, could of course be exploited in looping, but not that pertinent to my question) pointed by a collection of indices. Basically this code could be used to calculate, say, a cost of a route through a 2D matrix. I'm looking for a way to utilize CPU, not GPU.

Here's some relevant code, the one I'm more interested is the first case. I was thinking it's possible to use std::accumulate with a lambda to capture the indices vector, but then I got wondering, if there's already a neater way, perhaps with some other operator. Not a "real problem" as looping is quite clear for my tastes too, but in hunt for the super-neat or more efficient on-liner...

template<typename out_type>
out_type sum(std::vector<float> const& matrix, std::vector<int> const& indices)
{
    out_type cost = 0;
    for(decltype(indices.size()) i = 0; i < indices.size() - 1; ++i) 
    {
        const int index = indices.size() * indices[i] + indices[i + 1];
        cost += matrix[index];
    }

    const int index = indices.size() * indices[indices.size() - 1] + indices[0];
    cost += matrix[index];

    return cost;
}

template<typename out_type>
out_type sum(std::vector<std::vector<float>> const& matrix, std::vector<int> const& indices)
{
    out_type cost = 0;
    for(decltype(indices.size()) i = 0; i < indices.size() - 1; i++) 
    {
        cost += matrix[indices[i]][indices[i + 1]];
    }
    cost += matrix[indices[indices.size() - 1]][indices[0]];

    return cost;
}

Oh, and PPL/TBB are fair game too.

Edit

As an afterthought and as commented to John, would there be a place to employ std::common_type in the calculation as the input and output types may differ? This is a bit of hand-waving and more like learning techniques and libraries. A form of code kata, if you will.

Edit 2

Now, there's one option to make the loops faster, explained in blog writing How to process a STL vector using SSE code by a blogger theowl84. The code uses __m128 directly, but I wonder if there's something in DirectXMath library too.

Edit 3

Now, after writing some concrete code, I found std::accumulate wouldn't get me far. Or at least I couldn't find a way to do the [indices[i + 1] part in matrix[indices[i]][indices[i + 1]]; in a neat way, as std::accumulate itself gives access to only the current value and the sum. In that light, it looks like novelocrat's approach would be the most fruitful one.

DeadMG proposed using parallel_reduce with associativity caveats, further commented by novelocrat. I didn't go about seeing if I could use parallel_reduce, as the interface looked somewhat cumbersome for quick trying. Other than that, even though my code executes serially, it would suffer from the same floating some issues as the parallel reduction version. Though the parallel version would/could be (much) more unpredictable with than serial version, I think.

This goes somewhat tangential, but it may be of interest to some stumbling here, and to those of whom have read this far, may be (very) interested on article Wandering Precision in The NAG blog, which details some intricanciens even introduced by hardware instruction re-ordering! Then there are some ruminations about this very issue in distributed setting in #AltDevBlogADay Synchronous RTS Engines and a Tale of Desyncs. Also, ACCU (the general mailing list is excellent, by the way, and it's free to join) features several articles (e.g. this) on floating point accuracy. A tangential to tangential, I found Fernando Cacciola's Robustness issues in geometric computing to be a good article to read, originally from ACCU mailing list.

And then then the std::common_type. I couldn't find usage for that. If I had two different types as parameters, then the return value could/should be decided by std::common_type. Perhaps more pertinent is std::is_convertible with static_assert to make sure the desired result type is convertible from the argument types (with a clean error message). Other than that, I can only make up a check that the return value/intermediate calculation value accurracy is sufficient to represent the result of summation without overflows and things like that, but I haven't come across a standard facility for that.

That about that, I think, ladies and gentlemen. I enjoyed myself, I hope those reading this got something out of this too.

Community
  • 1
  • 1
Veksi
  • 3,556
  • 3
  • 30
  • 69
  • 1
    All the references to `route` should be `indices`, right? – John Calsbeek Sep 08 '12 at 21:06
  • John, that is right. I got a bit entangled as I was writing the code with a specific use case in mind (not yet implemented. In fact, a thought occurred to me that should I also try to fit in std::common_type, just as to make this line of exercise more complete. – Veksi Sep 08 '12 at 21:10
  • A note here that I needed to prioritize a bit, but I'll be back (the latest and probably) the day after tomorrow. So as to everybody now I haven't quit on this. :-) – Veksi Sep 09 '12 at 21:43
  • If you're looking for a portable way to vectorize your code, have a look at [SIMDIA](https://charm.cs.uiuc.edu/cgi-bin/gitweb2.cgi?p=charm.git;a=blob;f=src/util/simd.h;h=6a9ca798b16d871de6d3a08c397ff3f31b9e83f9;hb=HEAD) – Phil Miller Sep 10 '12 at 02:11
  • @EDIT2 In general, by all means, don't **ever** do it like written in this blog, which is certainly **not** an option to make any loops faster in any way. Those unaligned loads and stores will kill every minor performance gained by the 2-vectorization of this cheap arithmetic instruction and might even make this whole thing slower than without SSE. I generally don't think SSE will help in your case due to the quite scattered memory accesses. – Christian Rau Sep 10 '12 at 12:38
  • You are right in that, Christian. I phrased it badly, I was thinking more in line of processing the indices vector more in parallel. That is, making multiple scattered reads at once and then summing the values. Though, I've reached a conclusion this code is "good enough", for me. :) – Veksi Sep 11 '12 at 06:58

2 Answers2

1

You could produce an iterator that takes matrix and indices and yields the appropriate values.

class route_iterator
{
  vector<vector<float>> const& matrix;
  vector<int> const& indices;
  int i;

public:
  route_iterator(vector<vector<float>> const& matrix_, vector<int> const& indices_,
                 int begin = 0)
  : matrix(matrix_), indices(indices_), i(begin)
  { }
  float operator*() {
    return matrix[indices[i]][indices[(i + 1) % indices.size()]];
  }
  route_iterator& operator++() {
    ++i;
    return *this;
  }
};

Then your accumulate runs from route_iterator(matrix, indices) to route_iterator(matrix, indices, indices.size()).

Admittedly, though, this sequentializes without a smart compiler turning it into something parallel. What you really want are parallel map and fold (accumulate) operations.

Phil Miller
  • 36,389
  • 13
  • 67
  • 90
  • You are right about the map and fold operators. Just nick in time I edited my question to make more explicit that I'm not thinking of doing anything in GPU, as that would be somewhat obvious. I need to sleep now (well past midning here), but I'll get back to this some point tomorrow/today. – Veksi Sep 08 '12 at 21:35
0
out_type cost = 0;
for(decltype(indices.size()) i = 0; i < indices.size() - 1; i++) 
{
    cost += matrix[indices[i]][indices[i + 1]];
}

This is basically std::accumulate. PPL provides (and so does TBB, if I recall) parallel_reduce. This requires associativity but not commutivity, and + over the real/float/integer is associative.

Christian Rau
  • 45,360
  • 10
  • 108
  • 185
Puppy
  • 144,682
  • 38
  • 256
  • 465
  • 1
    To be really pedantic, `+` over `float` is not actually associative. Consider `1 + 1 + epsilon + epsilon`: adding from the left gives you `2`, adding from the right will give you `2*(1 + epsilon)`. – Phil Miller Sep 10 '12 at 02:10