0

I am trying to remove some outliers from my data set. I am investigating each variable in the data one at a time. I have constructed boxplots for variables but don't want to remove all the classified outliers, only the most extreme. So I am noting the value on the boxplot that I don't want my variable to exceed and trying to remove rows that correspond to the observations that have a specific column value that exceed the chosen value.

For example, My data set is called milk and one of the variables is called alpha_s1_casein. I thought the following would remove all rows in the data set where the value for alpha_s1_casein is greater than 29:

milk <- milk[milk$alpha_s1_casein < 29,]

In fact it did. The amount of rows in the data frame decreased from 430 to 428. However it has introduced a lot of NA values in noninvolved columns in my data set

Before I ran the above code the amount of NA's were

sum(is.na(milk))

5909 NA values But after performing the above the sum of NA's now returned is

sum(is.na(milk))

75912 NA values.

I don't understand what is going wrong here and why what I'm doing is introducing more NA values than when I started when all I'm trying to do is remove observations if a column value exceeds a certain number.

Can anyone help? I'm desperate

Phil
  • 7,287
  • 3
  • 36
  • 66
daisy
  • 61
  • 1
  • 6
  • I'd recommend using `dplyr` tools as they are more explicit in what you are doing: `dplyr::filter(milk, alpha_s1_casein < 29)`, – Phil Feb 26 '21 at 15:12

1 Answers1

0

Without using additional packages, to remove all rows in the data set where the value for alpha_s1_casein is greater than 29, you can just do this:

milk <- milk[-which(milk$alpha_s1_casein > 29),]
bricx
  • 593
  • 4
  • 18