3

I'm trying to find the weighted median of a column based on multiple columns. Here is an example:

Date        |    Item    |   BetterPrice  |   TotalCost    |   Location
-------------------------------------------------------------------------
2022-03-01  |     AB     |       0        |      200       |      3
2022-03-01  |     AB     |       0        |      200       |      2
2022-03-01  |     AB     |       1        |      300       |      3
2022-03-01  |     AC     |       1        |      400       |      2
2022-04-01  |     AB     |       1        |      400       |      1
2022-04-01  |     AC     |       1        |      100       |      3
2022-04-01  |     AC     |       0        |      50        |      1

I want to find weighted median of the Location column and I want to use the TotalCost column as weights. I want to use aggregate TOO since I want to find the sum of the BetterPrice column as well.

Originally I tried to used the package wquantiles with the weighted.median function, and to do the above tasks I tried something similar to the following code:

import wquantiles
wm = lambda x: weighted.median(x , TotalCost)
df2 = df.groupby(['Date',  'Item']).agg({'BetterPrice': 'sum', "Location": wm}).reset_index()

Unfortunately, this doesn't seem to work, so I thought I should try to ask how best to accomplish this task, thank you!

Also, here is a dataframe to replicate my above problem (hopefully it is detailed enough)

import pandas as pd

data={'Date':['2022-03-01','2022-03-01','2022-03-01','2022-03-01', '2022-04-01', '2022-04-01', '2022-04-01'],'Item':['AB','AB','AB','AC', 'AB', 'AC', 'AC'],'BetterPrice':[0,0,1,1, 1, 1, 0],'TotalCost':[200,200,300,400, 400, 100, 50],'Location':[3,2,3,2,1,3, 1]}


df=pd.DataFrame(data)
  • 1
    Hi and welcome here. I believe [this answer](https://stackoverflow.com/a/33054358/9274732) would help you with the weighted average part. use `apply` instead of `agg`c after the `groupby`. Also why you use `size` on the column BetterPrice while you say you want the `sum` in the text? – Ben.T Jun 28 '22 at 02:29
  • 1
    Ah sorry, I have updated my code now with `sum`. Also, I will try your linked answer - I actually may have tried it a few days ago since it looks very familiar, but I will try again just in case – rightleftdownup313 Jun 28 '22 at 02:36

1 Answers1

1

Here is a way to do it. To get the weighted median, you can get the np.median of the np.repeat values in Location with TotalCost. Do this per group. Then concat the result with the sum on the other column per group as well.

gr = df.groupby(['Date',  'Item'])
res = pd.concat(
    [gr.apply(lambda x: np.median(np.repeat(x.Location, x.TotalCost))),
     gr['BetterPrice'].sum()],
    keys=['Loc_weigthed_median','BetterPrice_sum'],
    axis=1
).reset_index()
print(res)
#          Date Item  Loc_weighted_median  BetterPrice_sum
# 0  2022-03-01   AB                  3.0                1
# 1  2022-03-01   AC                  2.0                1
# 2  2022-04-01   AB                  1.0                1
# 3  2022-04-01   AC                  3.0                1
Ben.T
  • 29,160
  • 6
  • 32
  • 54
  • Thanks! I'm going to try your code out now. I was wondering though, how exactly does it work? For instance, why do you use repeat(...) in the median? – rightleftdownup313 Jun 28 '22 at 06:08
  • 1
    @rightleftdownup313 there is no implementation of weighted median in numpy and I don't have the `wquantiles` installed. To get the median, you need all the element of the population (while the weighted mean does not need to see each element of the population, the calculation is slightly different). Anyway, so for each group by, you repeat each element of the population with the weight given in the other column and get the median. – Ben.T Jun 28 '22 at 12:23