0

I have a panel dataset that goes like this

year id treatment_year time_to_treatment outcome
2000 1 2011 -11 2
2002 1 2011 -10 3
2004 2 2015 -9 22

and so on and so forth. I am trying to deal with the outliers by 'Winsorize'. The end goal is to make a scatterplot with time_to_treatment on the X axis and outcome on the Y.

I would like to replace the outcomes for each time_to_treatment by its winsorized outcomes, i.e. replace all extreme values with the 5% and 95% quantile values. So far what I have tried to do is this but it doesn't work.

for(i in range(dataset$time_to_treatment)){
    dplyr::filter(dataset, time_to_treatment == i)$outcome <-  DescTools::Winsorize(dplyr::filter(dataset,time_to_treatment==i)$outcome)
}

I get the error - Error in filter(dataset, time_to_treatment == i) <- *vtmp* : could not find function "filter<-"

Would anyone able to give a better way? Thanks.


my actual data where: conflicts = outcome, commission = year of treatment, CD_mun = id.

The concerned time period indicator is time_to_t

Groups: year, CD_MUN, type [6]

type CD_MUN year time_to_t conflicts commission
chr dbl dbl dbl int dbl
manif 1100023 2000 -11 1 2011
manif 1100189 2000 -3 2 2003
manif 1100205 2000 -9 5 2009
manif 1500602 2000 -4 1 2004
manif 3111002 2000 -11 2 2011
manif 3147006 2000 -10 1 2010
suyash_n
  • 3
  • 2
  • Question is, do you need the `for` loop at all? Have you tried without it? – Chris Ruehlemann Dec 22 '21 at 11:32
  • What about `dplyr::mutate(dat, outcome_w=DescTools::Winsorize(outcome))`? – jay.sf Dec 22 '21 at 11:43
  • Thanks I tried making a function but dont see how I could add it back to my original dataset. @ChrisRuehlemann – suyash_n Dec 22 '21 at 12:42
  • thanks but I dont want to winsorize the entire outcome column, since values change over time. I want to do it for each individual time periods. @jay.sf – suyash_n Dec 22 '21 at 12:43
  • 1
    @suyash_n Maybe you could provide more suitable data with those time periods you mention. Please read [how-to-make-a-great-r-reproducible-example](https://stackoverflow.com/a/5963610/6574038). – jay.sf Dec 22 '21 at 12:47
  • @jay.sf just updated the question with my actual data – suyash_n Dec 22 '21 at 13:04

2 Answers2

1

Assuming, "time periods" refer to 'commission' column, you may use ave.

transform(dat, conflicts_w=ave(conflicts, commission, FUN=DescTools::Winsorize))
#    type  CD_MUN year time_to_t conflicts commission conflicts_w
# 1 manif 1100023 2000       -11         1       2011        1.05
# 2 manif 1100189 2000        -3         2       2003        2.00
# 3 manif 1100205 2000        -9         5       2009        5.00
# 4 manif 1500602 2000        -4         1       2004        1.00
# 5 manif 3111002 2000       -11         2       2011        1.95
# 6 manif 3147006 2000       -10         1       2010        1.00

Data:

dat <- structure(list(type = c("manif", "manif", "manif", "manif", "manif", 
"manif"), CD_MUN = c(1100023L, 1100189L, 1100205L, 1500602L, 
3111002L, 3147006L), year = c(2000L, 2000L, 2000L, 2000L, 2000L, 
2000L), time_to_t = c(-11L, -3L, -9L, -4L, -11L, -10L), conflicts = c(1L, 
2L, 5L, 1L, 2L, 1L), commission = c(2011L, 2003L, 2009L, 2004L, 
2011L, 2010L)), class = "data.frame", row.names = c(NA, -6L))
jay.sf
  • 60,139
  • 8
  • 53
  • 110
0

For a start you may use this:

# The data
set.seed(123)
df <- data.frame(
  time_to_treatment = seq(-15, 0, 1),
  outcome = sample(1:30, 16, replace=T)
)

# A solution without Winsorize based solely on dplyr
library(dplyr)
df %>% 
  mutate(outcome05 = quantile(outcome, probs = 0.05), # 5% quantile
         outcome95 = quantile(outcome, probs = 0.95), # 95% quantile
         outcome = ifelse(outcome <= outcome05, outcome05, outcome), # replace
         outcome = ifelse(outcome >= outcome95, outcome95, outcome)) %>% 
  select(-c(outcome05, outcome95))

You may adapt this to your exact problem.

timm
  • 257
  • 1
  • 10
  • Thanks a lot! I tried adapting the code as follows. here conflicts = outcome in my data `for(i in range(noliers$time_to_t)){ filter(noliers, time_to_t == i)%>% mutate(outcome05 = quantile(conflicts, probs = 0.05), # 5% quantile outcome95 = quantile(conflicts, probs = 0.95), # 95% quantile outcome = ifelse(conflicts <= outcome05, outcome05, conflicts), # replace outcome = ifelse(conflicts >= outcome95, outcome95, conflicts)) %>% select(-c(outcome05, outcome95)) }`. however, nothing changes, even if i try with very narrow quantile ranges. :-/ – suyash_n Dec 22 '21 at 13:17
  • Alright. Have you tried ```noliers %>% group_by(time_to_t) %>% ...``` instead of using a loop with a filter? – timm Dec 22 '21 at 13:49
  • Well that didn't work either, totally at a loss – suyash_n Dec 22 '21 at 14:05
  • Okay. How many observations do you have per ```time_to_t```? More than one or two, I guess? – timm Dec 22 '21 at 14:15
  • ... and if you want to use ```filter```. You should use ```dplyr::filter```, as there might be conflicts with other packages. – timm Dec 22 '21 at 14:20