10

I'm looking for good study material about computation of weighted median algorithm and/or sample code in C++. The weights of my median are values between 0 and 1. Could you recommend me some links?

chikuba
  • 4,229
  • 6
  • 43
  • 75
Viper
  • 597
  • 1
  • 8
  • 24
  • do you have a pair of values `[x, y]` and like to take the weighted median where `y` is your weight? please elaborate your question a bit. – Filip Roséen - refp Mar 20 '12 at 20:44
  • I have tried to use Boost library implementation but I'd like to more understand this algorithm, because I need to design a special variant of this solution in my case. – Viper Mar 20 '12 at 20:46
  • I need to find the value, which minimizes so called the weighted classification error. So I have a pair of values [error, weights] where values are natural numbers but weights are fractions between 0 and 1. I have read that I can find the minimum value in linear time using weighted median algorithm... – Viper Mar 20 '12 at 20:50

2 Answers2

20

The weighted median is defined like so:

If x is a sorted array of N elements, and w is the array of weights with a total weight W, then the weighted median is the last x[i] such that the sum of w[i] and of all previous weights are less than or equal to S/2.

In C++, this can be expressed like so (assuming x, w and W are defined as above)

double sum = 0;
int i;
for(i = 0; i < N; ++i)
{
    sum += w[i];
    if(sum > W/2)
        break;
}
double median = x[i-1];

EDIT

So it seems I answered this question too hastily and made some mistakes. I found a neat description of weighted median from the R documentation, which describes it like so:

For the n elements x = c(x[1], x[2], ..., x[n]) with positive weights w = c(w[1], w[2], ..., w[n]) such that sum(w) = S, the weighted median is defined as the element x[k] for which initial the total weight of all elements x[i] < x[k] is less or equal to S/2 and for which the total weight of all elements x[i] > x[k] is less or equal to S/2.

From this description, we have a pretty straight-forward implementation of the algorithm. If we start with k == 0, then there are no elements before x[k], so the total weight of elements x[i] < x[k] will be less than S/2. Depending on the data, the total weight of the elements x[i] > x[k] may or may not be less than S/2. So we can just move forward through the array until this second sum becomes less than or equal to S/2:

#include <cstddef>
#include <numeric>
#include <iostream>

int main()
{
  std::size_t const N = 5;
  double x[N] = {0, 1, 2, 3, 4};
  double w[N] = {.1, .2, .3, .4, .5};

  double S = std::accumulate(w, w+N, 0.0); // the total weight

  int k = 0;
  double sum = S - w[0]; // sum is the total weight of all `x[i] > x[k]`

  while(sum > S/2)
  {
    ++k;
    sum -= w[k];
  }

  std::cout << x[k] << std::endl;
}

Note that if the median is the last element (medianIndex == N-1), then sum == 0, so the condition sum > S/2 fails. Thus, k will never be out of bounds (unless N == 0!). Also, if there are two elements that satisfy the condition, the algorithm always selects the first one.

Ken Wayne VanderLinde
  • 18,915
  • 3
  • 47
  • 72
  • 1
    cute. i guess strictly it's >=? or in the case of equality do you take the mean of that and the following? or am i being obsessive? ;o) – andrew cooke Mar 20 '12 at 20:55
  • @andrewcooke: My code was right, my description was slightly wrong. It's been fixed. And if you really want to be obsessive, there are many medians which can be used. Actually, any value in the range `[x[i-1], x[i])` is a median. – Ken Wayne VanderLinde Mar 20 '12 at 21:03
  • oh, i missed the `i-1`. so what happens if w[0] is 0.9? does `i` increment on break (otherwise you have `x[-1]`)? might be better to initialise sum to `w[0]` and start the loop from 1? no, that doesn't seem right either. sorry, may be confused. i know what you mean, anyway. – andrew cooke Mar 20 '12 at 21:10
  • @KenWayneVanderLine. Thank you. Are you sure that your algorithm is correct? I know that for n elements with all weights equals 1/n the weighted median should be "usual median". So for {1,2,3,4,5}, {1/5, 1/5,1/5,1/5,1/5} it should be "3". Your algorithm gives me "2". Similarly for even values of numbers {1,2,3,4,5,6} with the same weights it should be 3 and 4 (literally (3+4)/2), but your algorithm gives me a "2". So maybe not x[i-1] but x[i] instead?Please correct me if I did something wrong. – Viper Mar 20 '12 at 23:07
  • @Viper: Yeah, I messed it up, so the index will be out by one each time. I added to new, correct implementation. As for your point about averaging 3 and 4, this is only one option. You can actually use any value between 3 and 4 (inclusively) as the median - my code will always spit out 3, but this should be easy to customize. – Ken Wayne VanderLinde Mar 20 '12 at 23:22
2

Here is an implementation of the weighted median for initially unsorted vectors. It builds on the very good answer by @Ken Wayne VanderLinde for the median calculation, and on the index sorter given in this thread.

    template <typename VectorType>
    auto sort_indexes(VectorType const& v)
    {
        std::vector<int> idx(v.size());
        std::iota(std::begin(idx), std::end(idx), 0);

        std::sort(std::begin(idx), std::end(idx), [&v](int i1, int i2) {return v[i1] < v[i2];});

        return idx;
    }

    template<typename VectorType1, typename VectorType2>
    auto weightedMedian(VectorType1 const& x, VectorType2 const& weight)
    {
        double totalWeight = 0.0;
        for (int i = 0; i < static_cast<int>(x.size()); ++i)
        {
            totalWeight += weight[i];
        }

        auto ind = sort_indexes(x);

        int k = ind[0];
        double sum = totalWeight - weight[k];

        for (int i = 1; i < static_cast<int>(ind.size()); ++i)
        {
            k = ind[i];
            sum -= weight[k];

            if (sum <= 0.5 * totalWeight)
            {
                break;
            }
        }
        return x[k];
    }

It works with any vector type which supports operator[](int) and size() (--therefore no use is made of std::accumulate etc.).

Community
  • 1
  • 1
davidhigh
  • 14,652
  • 2
  • 44
  • 75
  • 1
    it won't work for the following case: elements=[1,2], weights=[100, 50]. meaning, it wont work if the result should be the first element. you can easily fix that by starting the loop from 0, init sum to totalWeight, just declare k outside the loop and verify the vector is not empty. – Jonathan Jun 08 '21 at 08:16
  • @Jonathan: thanks for the test and the suggested correction -- please feel free to edit it into the code. – davidhigh Jun 20 '21 at 11:30