I'd like to make a function that removes all outliers from my data set. I've read a lot of Stack Overflow articles about this, so I am aware of the dangers of removing outliers, but all of the functions I've seen so far, didn't fit my type of data. Here's what I have so far:
My minimal data set example:
ID, Treatment, conc, relabs
1, A, 40.00, 1.0793923
2, A, 40.00, 0.6436631
3, A, 40.00, 0.5556844
4, A, 40.00, 0.4834845
5, A, 40.00, 0.7224756
6, A, 40.00, 0.6804259
7, A, 20.00, 0.9958288
8, A, 20.00, 0.7099360
9, A, 20.00, 0.7028124
10, A, 20.00, 0.5016352
11, A, 20.00, 0.6860346
12, A, 20.00, 0.7341970
13, A, 10.00, 0.8175491
14, A, 10.00, 0.6900910
15, A, 10.00, 0.5278228
16, A, 10.00, 0.7560026
17, A, 10.00, 0.8841343
18, A, 10.00, 0.6687616
19, A, 5.00, 0.8563232
20, A, 5.00, 0.7419997
21, B, 0.80, 1.2049695
22, B, 0.80, 0.4969811
23, B, 0.80, 0.2835814
24, B, 0.80, 0.6700250
25, B, 0.80, 1.3126651
26, B, 0.80, 0.4510617
27, B, 0.60, 0.7629639
28, B, 0.60, 0.7513716
19, B, 0.60, 0.7956074
I use identify_outliers
funtion from rstatix
package to identify outliers by different Treatment and conc, it gives me data frame with two new colums is.outlier
and is.extreme
.
df_outliers <-
df %>%
group_by(Treatment, conc) %>%
identify_outliers("relabs")
df_outliers
Then I manually remove the outliers by just pasting the ID in slice
function from dplyr package from df_outliers data frame, which would be troublesome if I had a bigger data set:
df_wo_outliers <-
df %>%
slice(-c(1, 7, 10, 19 )) %>%
select(-ID)
df_wo_outliers
I someohow need to automatically remove the rows where is.outlier = TRUE
from my original dataset relabs column.
That would mean that within that concentration (variable conc) and Treatment (variable Treatment) relative absorbtion (variable relabs) was too high or to low (Q3 + 1.5xIQR/Q1 - 1.5xIQR).
I am open to hearing any suggestions for the function or writing my own, however I'm unsure how to filter the data so that it would remove outliers within different groups in dataset by that I mean by Treatment and conc and not the whole dataset as I've seen is talked about a lot.
Plus is there a way to calculate confidence intervals in a similar way? Since I've not yet filtered my dataset the right way, I believe I will have similar issue
I'm also including a picture of a part of my data if needed: section of my data set
I'm working on Windows 10, R version 1.3.1073