How to drop observations with values in the top and bottom one percent by year in R

Question

I am trying to drop observations with prices in the top and bottom one percent, by year. I have been attempting to use dplyr's group_by function to group by year_sold and then mutate()to create a variable to_drop whose value is conditional on the variable price being between the 1st and 99th percentile. Here's what I have so far:

df <- df %>%  dplyr::group_by(year_sold) %>%
  mutate(to_drop = ifelse(price <= quantile(price,0.01) | price >= quantile(price,0.99),1,0))

However, I am not getting the quantiles of price grouped by year_sold. Removing dplyr::group_by(year_sold) %>% doesn't seem to change my results.

I'm trying to find alternatives to Stata's very useful bysort command. Here's how I would have done this in Stata:

gen to_drop = 0
foreach y in year_sold { 
    quietly bysort `y': summarize price, detail // To get r(p1) and r(p99)
    bysort `y': replace to_drop = 1 if ! inrange(price, r(p1), r(p99))
}

Can someone please help me either figure out why group_by isn't working as I would expect it to, or alternatively, help me figure out another way of achieving this task in R?

Hm. Your code looks fine and works fine. I checked on `mtcars`, grouped by `cyl` and computed quantiles of `mpg`. — stefan, Nov 18 '20 at 06:33
Try using `dplyr::mutate`. If it still doesn't work as expected consider providing a [reproducible example](http://stackoverflow.com/questions/5963269). — Ronak Shah, Nov 18 '20 at 06:48
Thank you @stefan. I should have thought of checking my code with another dataset. I tried it with `mtcars` and `iris` and it worked with both datasets. In fact, comparing it to my Stata code, I realized that my Stata code wasn't doing what I expected it to. So I'm very glad to have caught that, with your help. — Amila, Nov 19 '20 at 07:09
And thank you @RonakShah. Adding `dplyr::` before `mutate` fixed the problem the above code had and it now matches Neeraj's code below. — Amila, Nov 19 '20 at 07:12

score 0 · Accepted Answer · answered Nov 18 '20 at 08:17

0

You can use base split and lapply function to get desired results.

library(magrittr)

 #generating data
 df <- data.frame(year = rep(c(2001,2002), each = 20), price = runif(40, 40, 100))

filterf <- function(df) {
  q <- quantile(df$price, c(.01, .99))
  df[ df$price > q[1] & df$price < q[2], ]
}

split(df, df$year) %>% lapply(., FUN = filterf) %>% Reduce(rbind, .)

answered Nov 18 '20 at 08:17

Neeraj

1,166
9
21

Thank you! This code does exactly what I expect it to. And it matches the results the above code gives me (after changing `mutate` to `dplyr::mutate` above). – Amila Nov 19 '20 at 07:14

How to drop observations with values in the top and bottom one percent by year in R

1 Answers1