0

I'd like to make a function that removes all outliers from my data set. I've read a lot of Stack Overflow articles about this, so I am aware of the dangers of removing outliers, but all of the functions I've seen so far, didn't fit my type of data. Here's what I have so far:

My minimal data set example:

ID, Treatment, conc, relabs
1, A, 40.00, 1.0793923
2, A, 40.00, 0.6436631
3, A, 40.00, 0.5556844
4, A, 40.00, 0.4834845
5, A, 40.00, 0.7224756
6, A, 40.00, 0.6804259
7, A, 20.00, 0.9958288
8, A, 20.00, 0.7099360
9, A, 20.00, 0.7028124
10, A, 20.00, 0.5016352
11, A, 20.00, 0.6860346
12, A, 20.00, 0.7341970
13, A, 10.00, 0.8175491
14, A, 10.00, 0.6900910
15, A, 10.00, 0.5278228
16, A, 10.00, 0.7560026
17, A, 10.00, 0.8841343
18, A, 10.00, 0.6687616
19, A, 5.00, 0.8563232
20, A,  5.00, 0.7419997
21, B, 0.80, 1.2049695
22, B, 0.80, 0.4969811
23, B, 0.80, 0.2835814
24, B, 0.80, 0.6700250
25, B, 0.80, 1.3126651
26, B, 0.80, 0.4510617
27, B, 0.60, 0.7629639
28, B, 0.60, 0.7513716
19, B, 0.60, 0.7956074

I use identify_outliers funtion from rstatix package to identify outliers by different Treatment and conc, it gives me data frame with two new colums is.outlier and is.extreme.

df_outliers <-
df %>% 
  group_by(Treatment, conc) %>% 
  identify_outliers("relabs") 

df_outliers

Then I manually remove the outliers by just pasting the ID in slice function from dplyr package from df_outliers data frame, which would be troublesome if I had a bigger data set:

df_wo_outliers <- 
  df %>% 
  slice(-c(1, 7, 10, 19 )) %>% 
  select(-ID)

df_wo_outliers

I someohow need to automatically remove the rows where is.outlier = TRUE from my original dataset relabs column.

That would mean that within that concentration (variable conc) and Treatment (variable Treatment) relative absorbtion (variable relabs) was too high or to low (Q3 + 1.5xIQR/Q1 - 1.5xIQR).

I am open to hearing any suggestions for the function or writing my own, however I'm unsure how to filter the data so that it would remove outliers within different groups in dataset by that I mean by Treatment and conc and not the whole dataset as I've seen is talked about a lot.

Plus is there a way to calculate confidence intervals in a similar way? Since I've not yet filtered my dataset the right way, I believe I will have similar issue

I'm also including a picture of a part of my data if needed: section of my data set

I'm working on Windows 10, R version 1.3.1073

Simona
  • 87
  • 2
  • 8

2 Answers2

3

You could use an anti_join() in dplyr after getting the outliers. Note, when in my df_outliers I only have IDs 1, 7 and 10.

library(tidyverse)
library(rstatix)

df <- tibble(
                ID = c(1L,2L,3L,4L,5L,6L,7L,8L,
                       9L,10L,11L,12L,13L,14L,15L,16L,17L,18L,19L,
                       20L,21L,22L,23L,24L,25L,26L,27L,28L,19L),
         Treatment = c("A","A","A","A","A","A",
                       "A","A","A","A","A","A","A","A","A","A","A","A",
                       "A","A","B","B","B","B","B","B","B","B","B"),
              conc = c(40,40,40,40,40,40,20,20,
                       20,20,20,20,10,10,10,10,10,10,5,5,0.8,0.8,
                       0.8,0.8,0.8,0.8,0.6,0.6,0.6),
            relabs = c(1.0793923,0.6436631,0.5556844,
                       0.4834845,0.7224756,0.6804259,0.9958288,0.709936,
                       0.7028124,0.5016352,0.6860346,0.734197,0.8175491,
                       0.690091,0.5278228,0.7560026,0.8841343,0.6687616,
                       0.8563232,0.7419997,1.2049695,0.4969811,0.2835814,0.670025,
                       1.3126651,0.4510617,0.7629639,0.7513716,0.7956074)
)

df_outliers <- df %>% 
  group_by(Treatment, conc) %>% 
  identify_outliers("relabs") 

# A tibble: 3 x 6
  Treatment  conc    ID relabs is.outlier is.extreme
  <chr>     <dbl> <int>  <dbl> <lgl>      <lgl>     
1 A            20     7  0.996 TRUE       TRUE      
2 A            20    10  0.502 TRUE       TRUE      
3 A            40     1  1.08  TRUE       FALSE  

# without outliers
df %>% 
  anti_join(df_outliers, by = "ID") %>% 
  view()

# A tibble: 26 x 4
      ID Treatment  conc relabs
   <int> <chr>     <dbl>  <dbl>
 1     2 A            40  0.644
 2     3 A            40  0.556
 3     4 A            40  0.483
 4     5 A            40  0.722
 5     6 A            40  0.680
 6     8 A            20  0.710
 7     9 A            20  0.703
 8    11 A            20  0.686
 9    12 A            20  0.734
10    13 A            10  0.818
# … with 16 more rows
william3031
  • 1,653
  • 1
  • 18
  • 39
0

You can use dplyr::filter() for this. Since you want to keep is.outlier == FALSE, you need to use the exclamation point as the negation operation.

library(dplyr)
df_no_outliers <- df %>%
  group_by(Treatment, conc) %>%
  identify_outliers("relabs") %>%
  filter(!is.outlier)
Ben Norris
  • 5,639
  • 2
  • 6
  • 15
  • Thank you for this, I've tried running it, but it gives me result of 0 rows. I believe it is because my original data set does not have the column is.outlier, thus it can not filter even for the ones which would be TRUE – Simona Feb 28 '21 at 16:56
  • @SimonaZubavičiūtė - If you could post a sample of your data, and tell us which package contains `identify_outliers()` I code test this code and verify. – Ben Norris Feb 28 '21 at 23:15
  • I've attached a picture of a section of my data now, identify_outliers is from a package rstatix. – Simona Mar 01 '21 at 11:52