Trimmed mean by group in R data.table

Question

I have a data.table on which I want to find the weighted mean of column performance by month.

  dat <- structure(list(year = c(2014, 2015, 2016, 2017, 2018, 2019, 2020, 
                                 2021, 2014, 2015, 2016, 2017, 2018, 2019, 2020), 
                        month = c(2, 
                                  2, 2, 2, 2, 2, 2, 2, 10, 10, 10, 10, 10, 10, 10), 
                        performance = c(0.826973794097158, 
                                        0.61975709469356, 0.924350659523548, -0.183133219063708, -0.529913189565746, 
                                        -0.148531188902535, -0.0773058814083695, 1.42862504650241, 0.465498268732376, 
                                        0.148719963224136, 0.205614191281359, 0.560651497949418, -0.484408605607923, 
                                        0.875353374774486, 0.351469397380814)), 
                   row.names = c(NA, -15L), class = c("data.table", "data.frame"))

This data.table looks like following -

    year month performance
 1: 2014     2  0.82697379
 2: 2015     2  0.61975709
 3: 2016     2  0.92435066
 4: 2017     2 -0.18313322
 5: 2018     2 -0.52991319
 6: 2019     2 -0.14853119
 7: 2020     2 -0.07730588
 8: 2021     2  1.42862505
 9: 2014    10  0.46549827
10: 2015    10  0.14871996
11: 2016    10  0.20561419
12: 2017    10  0.56065150
13: 2018    10 -0.48440861
14: 2019    10  0.87535337
15: 2020    10  0.35146940

To find the weighted mean by month, I have used the following code -

setDT(dat)[, lapply(.SD, function(x) weighted.mean(x, na.rm = TRUE)), by = .(month), .SDcols = c("performance")]

and the result I am getting is -

   month performance
1:     2   0.3576029
2:    10   0.3032712

However, the weighted mean performance of month 10 should be greater than month 2 as it has more positive values.

It seems that only the month 2 of the year 2021, is weighing heavily on its performance causing it to outperform the performance of month 10. Actually, the code above is only finding the mean and NOT the weighted.mean. The result is the same if I use mean instead of weighted.mean.

setDT(dat)[, lapply(.SD, function(x) mean(x, na.rm = TRUE)), by = .(month), .SDcols = c("performance")]

and the result after using simple mean is following, which is the same as the result of weighted.mean.

   month performance
1:     2   0.3576029
2:    10   0.3032712

The desired result should give equal weight to the performance of each year so that exceptional performance in one particular year does not falsely show that the product sells wonderfully during that month each year.

Can someone point out what is wrong with my weighted mean calculation?

What do you call weighted mean? in R this means `?weighted.mean`, and as you don't give any weight, you get a normal mean. — Waldi, Jun 05 '21 at 16:35
In the reference section of package stats, it is mentioned for ```weighted.mean``` - "If w is missing then all elements of x are given the same weight." I want to keep the weights for each year equal. — Saurabh, Jun 05 '21 at 16:36
If each year has equal weight, then the performance of month ```10``` should be better than month ```2```. See all the negative values in month ```2```. I have tried to explain it little better in question. If we find mean of sign of ```performance``` column, then month ```10``` is the clear winner. ```setDT(dat)[, lapply(.SD, function(x) mean(sign(x), na.rm = TRUE)), by = .(month), .SDcols = c("performance")] ``` — Saurabh, Jun 05 '21 at 16:38
no, because `mean=sum/n` and as you pointed out row 8 compensates all the negative values. Are you looking for median? — Waldi, Jun 05 '21 at 16:41
Using the ```median``` gives a better result than ```mean``` but that does not solve the problem. I want to assign equal weight to the performance for each month so that it does not dominate the end result. — Saurabh, Jun 05 '21 at 16:51

score 1 · Answer 1 · answered Jun 05 '21 at 16:32

1

As a new stackoverflow user I can not add commnets on post, so I will add my doubts here.

In general you are getting a simple mean with the code you porvide, and i do not understand clearly what your want, because normally when you want a weighted mean, you use a second variable as weight.

In your case, a simple mean return the same output:

library(dplyr)

dat %>% 
  group_by(month) %>% 
  summarise(performance = mean(performance))

answered Jun 05 '21 at 16:32

In the reference section of package ```stats```, it is mentioned for ```weighted.mean``` - "If w is missing then all elements of x are given the same weight." I want to keep the weights for each year equal. – Saurabh Jun 05 '21 at 16:34

Anup Tirpude · Answer 2 · 2021-06-05T17:13:17.963

If you use weighted.mean function without specifying the weights it will simply calculate an average for you. To calculate it correctly you can specify your weights as second parameter in your weighted.mean function.

library(data.table)
dat <- structure(list(year = c(2014, 2015, 2016, 2017, 2018, 2019, 2020, 
                               2021, 2014, 2015, 2016, 2017, 2018, 2019, 2020), 
                      month = c(2, 
                                2, 2, 2, 2, 2, 2, 2, 10, 10, 10, 10, 10, 10, 10), 
                      performance = c(0.826973794097158, 
                                      0.61975709469356, 0.924350659523548, -0.183133219063708, -0.529913189565746, 
                                      -0.148531188902535, -0.0773058814083695, 1.42862504650241, 0.465498268732376, 
                                      0.148719963224136, 0.205614191281359, 0.560651497949418, -0.484408605607923, 
                                      0.875353374774486, 0.351469397380814)), 
                 row.names = c(NA, -15L), class = c("data.table", "data.frame"))
head(dat)
setDT(dat)
dat[,.(weighted.mean(performance)), by = month]
dat[,.(mean(performance)), by = month]

R execution

So to resolve this issue you can do the following : add a column of weights into your dataset. I added wt variable as my weights. Here I just simply took a sequence 1 to 15 as my weights, you need to put exact values/weights in place of this. Then just add this parameter as argument in your weighted.mean function, I think this should resolve your problem.

dat$wt <- 1:nrow(dat)
weighted.mean(dat$performance,dat$wt) # will give you full column weighted mean
dat[,.(weighted.mean(performance,wt)), by = .(month)] # will give you weighted mean by month

R result :

This is exactly the problem I am facing. Both ```mean``` and ```weighted.mean``` gives the same result. — Saurabh, Jun 05 '21 at 16:52
Hi Saurabh, I have added some more information to my answer on how you can add weights. Please check, and let me know in case of you still have doubts & if this is what your looking for. — Anup Tirpude, Jun 05 '21 at 17:15
As you have used the incremental weights, doing so will increase the impact of recent years over the past years. This is not what I am looking for. I want to limit the max impact of a year's performance so that it does not artificially increase the performance for any month. — Saurabh, Jun 05 '21 at 17:25
My weights are just for your example.. you can use it as per your convenience. If you does not want to emphasize on any specific year using weighted mean is meaningless, as this is what for its made of. I recommend you to use some other measure like trimmed mean or median instead, this will do your work no need to go for weighted mean. — Anup Tirpude, Jun 05 '21 at 17:49
Thanks, Arup, Waldi has shared a kind of trimmed mean procedure while limiting the effects of extreme values. — Saurabh, Jun 05 '21 at 17:55

Waldi · Accepted Answer · 2021-06-06T19:03:34.807

You could simply remove outliers :

remove_outliers <- function(x, na.rm = TRUE, ...) {
  qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
  H <- 1.5 * IQR(x, na.rm = na.rm)
  y <- x
  y[x < (qnt[1] - H)] <- NA
  y[x > (qnt[2] + H)] <- NA
  y
}
setDT(dat)[, lapply(.SD, function(x) mean(remove_outliers(x))), by = .(month), .SDcols = c("performance")]

month performance
1:     2   0.3576029
2:    10   0.4345511

Or limit outliers, for instance to first and third quartile:

limit_outliers <- function(x, na.rm = TRUE, ...) {
  qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
  y <- x
  y[x < (qnt[1] )] <- qnt[1]
  y[x > (qnt[2] )] <- qnt[2] 
  y
}

setDT(dat)[, lapply(.SD, function(x) mean(limit_outliers(x), na.rm = TRUE)), by = .(month), .SDcols = c("performance")]

month performance
1:     2   0.3261458
2:    10   0.3432951

Removing the outliers is not giving the best results on the complete dataset. Is it possible to limit the max a year that can affect the end result? e.g. on line 8 of dataset ``` 8: 2021 2 1.42862505``` the max performance is 1.42 and I want to consider it no more than 1 (100%). — Saurabh, Jun 05 '21 at 17:14

Trimmed mean by group in R data.table

3 Answers3