3

It is best to give an example.

Let's say vector A consists of:

A = {3  ,2 ,1 ,4  ,6 ,3 ,8 ,4}

and vector B consists of:

B = {1.5,2 ,2 ,1.5,3 ,3 ,3 ,2}

The unique values in vector B are {1.5, 2, 3}

I want the resulting vector RESULT to be:

RESULT[0] = Average(A given B=1.5) = Average(3,4)

RESULT[1] = Average(A given B=2 )  = Average(2,1,4)

RESULT[2] = Average(A given B=3 )  = Average(6,3,8)

What is the most efficient way of calculating this. My own method is to loop over unique elements of B, and for each of them, loop over each B value trying to match that unique number and keep summing up the corresponding element of vector A in each match, also counting the number of matches so I can find the average.

This is too slow. since My vector A is 8M elements, and vector B consists of 0.5M unique values.

Any help would be appreciated.

Paul Roub
  • 36,322
  • 27
  • 84
  • 93
user1780424
  • 403
  • 1
  • 6
  • 12

2 Answers2

3

Here's a lazy idea: Traverse both vectors in lockstep and aggregate the results in a separate container. For example:

#include <cassert>
#include <cmath>
#include <iostream>
#include <map>
#include <utility>

std::map<double, std::pair<int, std::size_t>> m;

assert(A.size() == B.size());

for (std::size_t i = 0; i != A.size(); ++i)
{
    assert(!std::isnan(B[i]));

    auto & p = m[B[i]];
    p.first += A[i];
    p.second += 1;
}

In the end you just report the results:

for (const auto & p : m)
    std::cout << "Average for bin " << p.first << " is "
              << static_cast<double>(p.second.first) / p.second.second
              << "\n";

(Beware that your key value must not be NaN: in an ordered map, NaN is not part of the strict ordering; in an unordered map, it does not compare equal to itself.)

Kerrek SB
  • 464,522
  • 92
  • 875
  • 1,084
  • Why use a `map` instead of an `unordered_map`? – D Drmmr Jun 25 '14 at 22:19
  • @DDrmmr: Ordered traversal for reporting? – Kerrek SB Jun 25 '14 at 22:19
  • Nice touch to assert !isnan :) I remember to SO question (+1) – sehe Jun 25 '14 at 22:25
  • Thanks, this method looks good. I will try this as well. I didn't know "auto & p = m[B[i]];" would automatically take care of the map. Nice! and no, my data do not have any NaNs. – user1780424 Jun 25 '14 at 22:28
  • @sehe: I seem to remember having seen [a question about this](http://stackoverflow.com/q/8096817/596781) :-) – Kerrek SB Jun 25 '14 at 22:35
  • Of course, using a proper struct instead of the std::pair does make my answer a bit more expressive `average_state[B[i]].sample(A[i]);` FTW :) – sehe Jun 25 '14 at 22:37
  • 2
    @sehe: I'm pretty surprised you managed to answer this without any reference to Boost Phoenix, Qi, Karma, or MPL sequences. Not even any Fusion... – Kerrek SB Jun 25 '14 at 22:38
  • @KerrekSB Hey! Just the fact that I use tag [tag:boost] to make a dent in learning the boost libraries, doesn't mean I can't do things without it (it just means I have to focus - there's not enough time to more than just these tags :)) – sehe Jun 25 '14 at 22:40
  • @sehe: Hm, well, for once I was actually hoping you could whip out some magic zip iterator that allows the simultaneous sorting that I suggested in the comment above. It might be nice to also have a solution that requires no extra space... – Kerrek SB Jun 25 '14 at 22:50
  • @KerrekSB I pondered about abusing `std::transform` (in binary form). But I thought it abuse. Also, you could look at Boost Accumulators if you're looking for some extra fidelities (and compile time) :) – sehe Jun 25 '14 at 22:56
1

You can do a loop with a (hash) table: see it Live On Coliru

int main()
{
    vector<int>    A = {3  ,2 ,1 ,4  ,6 ,3 ,8 ,4};
    vector<double> B = {1.5,2 ,2 ,1.5,3 ,3 ,3 ,2};

    assert(A.size() == B.size());

    struct accum { 
        uintmax_t sum = 0; 
        size_t number_of_samples = 0; 
        void sample(int val) { sum += val; ++number_of_samples; }
    };
    map<double, accum> average_state;

    for(size_t i = 0; i<B.size(); ++i)
        average_state[B[i]].sample(A[i]);

    for(auto& entry : average_state)
    {
        accum& state = entry.second;
        double average = static_cast<double>(state.sum) / state.number_of_samples;
        std::cout << "Bucket " << entry.first << "\taverage of " << state.number_of_samples << " samples:\t" << average << "\n";
    }
}

Prints

Bucket 1.5  average of 2 samples:   3.5
Bucket 2    average of 3 samples:   2.33333
Bucket 3    average of 3 samples:   5.66667
sehe
  • 374,641
  • 47
  • 450
  • 633
  • why is the average called effective? – 4pie0 Jun 25 '14 at 22:27
  • because it's calculated. You can name it "average" of course. I have no idea about the problem domain so I made something up. **updated** – sehe Jun 25 '14 at 22:27