-1

I am filtering data for analysis and stumbled upon a problem I can not find a solution for. I did look into the prepdat-package but it does not seem to satisfy my needs. My dataframe(df) consists of reaction times of several participants measured over 4 blocks. To filter out outliers I need to apply a (mean +/- 2.5 sd)-rule for every block of each participant.

I tried creating my own function in order to apply this rule to every subsection (for each block of every participant seperatly) of my dataframe. I created the function below so I can use it with a for loop (this loop might not be optimal in R, but that is not the main concern here):

filter <- function(subject, block){ 
m   <- mean(df[df$subj == subject & df$block == block,3])
stdv<- sd(df[df$subj == subject & df$block == block,3])
lowerbound <- m - 2.5 * stdv
upperbound <- m + 2.5 * stdv
outliers <- which((df[(df$subj == subject & df$block == block),3] <= lowerbound |df[(df$subj == subject & df$block == block),3] >= upperbound)) #Here I retrieve the index for all the rows I need to eliminate
df <<- df[-c(outliers), ] 
}

I can't get my head around this indexing. For the first block of the first subject there seems to be no problem, and the function deletes the right rows. But for the next blocks (and subjects) 'outliers' also consists of the right indexes of the subset (subject and block) I ask to "select" in the function, but when I try to eliminate the rows by it, it looks like the indexes are applied to the indexes of my whole dataframe and not on the specific subset of the subject and block I used in my function. Is there something I am missing, or not (yet) aware of to use? Or is my overall way of thinking wrong??(I am still adapting to R)

subj block  rt
1     1     2 345
2     1     2 118
3     1     2 302
4     1     2 698
5     1     2 154
6     2     3 347
7     2     3 391
8     2     3 414
9     2     3 427
10    2     3 369
11    6     1 685
12    6     1 369
13    6     1 457
14    6     1 566
15    6     1 542
E.Crist
  • 3
  • 3
  • Could you provide (some of) your data in order to make a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) that others can run? – Mikko Marttila Jul 17 '18 at 13:28
  • Your `upperbound <- m - 2.5 * stdv` but should be `upperbound <- m + 2.5 * stdv` – akash87 Jul 17 '18 at 14:17
  • Yep, I changed my mistake. Thanks for pointing it out. – E.Crist Jul 17 '18 at 14:20

1 Answers1

0

dplyr might be better to use here:

df %>% 
group_by(subj, block) %>% 
dplyr::summarise(lb = mean(rt) - 2.5 * sd(rt), 
                 ub = mean(rt) + 2.5 * sd(rt)) %>%
inner_join(df, by = c("subj", "block")) %>%
ungroup() %>% 
filter(rt > lb & rt < ub)

Now this results in a tibble of the same size because there are no outliers by your definition. If I change your definition to 1.5 as opposed to 2.5, then we get 20 rows. It is a matter of your definition of outliers.

akash87
  • 3,876
  • 3
  • 14
  • 30