How to remove rows in a data set according to if values exceed a given number in a particular column in Rstudio

Question

I am trying to remove some outliers from my data set. I am investigating each variable in the data one at a time. I have constructed boxplots for variables but don't want to remove all the classified outliers, only the most extreme. So I am noting the value on the boxplot that I don't want my variable to exceed and trying to remove rows that correspond to the observations that have a specific column value that exceed the chosen value.

For example, My data set is called milk and one of the variables is called alpha_s1_casein. I thought the following would remove all rows in the data set where the value for alpha_s1_casein is greater than 29:

milk <- milk[milk$alpha_s1_casein < 29,]

In fact it did. The amount of rows in the data frame decreased from 430 to 428. However it has introduced a lot of NA values in noninvolved columns in my data set

Before I ran the above code the amount of NA's were

sum(is.na(milk))

5909 NA values But after performing the above the sum of NA's now returned is

sum(is.na(milk))

75912 NA values.

I don't understand what is going wrong here and why what I'm doing is introducing more NA values than when I started when all I'm trying to do is remove observations if a column value exceeds a certain number.

Can anyone help? I'm desperate

I'd recommend using `dplyr` tools as they are more explicit in what you are doing: `dplyr::filter(milk, alpha_s1_casein < 29)`, — Phil, Feb 26 '21 at 15:12

score 0 · Answer 1 · answered Feb 26 '21 at 16:45

0

Without using additional packages, to remove all rows in the data set where the value for alpha_s1_casein is greater than 29, you can just do this:

milk <- milk[-which(milk$alpha_s1_casein > 29),]

answered Feb 26 '21 at 16:45

bricx

593
4
18

How to remove rows in a data set according to if values exceed a given number in a particular column in Rstudio

1 Answers1