removing outliers after calculating cook distance for multiple regressions

Question

I am trying to remove the outliers from my main dataframe after calculating the cook distance for multiple linear regressions however, after getting the influential values I don't know how to remove the influential values from my dataframe.

I am giving a reproducible example of what I would like to achieve:

data(varechem, package = "vegan")
data(varespec, package = "vegan")
vare <- cbind(varespec, varechem)

for (j in c("Callvulg", "Empenigr")){
  for (soil in c("N", "K",  "S", "P", "Fe")){
    mod_name <- paste( soil, j, sep="_")
    cook_name <- paste( soil, j, sep="_") 
    name <- paste( soil, j, sep="_")
    outliers <- paste( soil, j, sep="_")
    mod_cook[[mod_name]] <- lm(vare[,j] ~ vare[,soil], na.action = na.exclude)
    cook_distance[[cook_name]] <- cooks.distance(mod_cook[[mod_name]])
    sample_size <- nrow(vare)
    influential[[name]] <- which(cook_distance[[cook_name]] > 4/sample_size)
    names_influencial[[outliers]] <- names(influential[[name]])
    
  }
}

Now, I have the influential values, which I suppose are the rows numbers and I want to delete these rows from my vare data frame, is there a way to achieve this last step? Thanks a lot!

It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input that can be used to test and verify possible solutions. — MrFlick, Nov 30 '22 at 14:48
The solution is not to delete outliers because they do not fit your model! Change your model instead of changing your data. — user2974951, Nov 30 '22 at 15:04
I understand your point, but still I would like to know how to remove these rows from the main data frame, because I was not able to find a coding solution when using the for loop with multiple regressions — i.b, Nov 30 '22 at 15:21

Santiago Capobianco · Answer 1 · 2022-11-30T15:55:30.687

0

Assuming that influential is the vector of indices of the observations you want to remove, and vare is a data.frame, then you can do:

vare <- vare[-influential,]

Edit:

If influential is a list of vector of indices:

influential <- unique(unlist(influential))
vare <- vare[-influential,]

Edit 2: replace ! for -, oops

Hope it helps

edited Nov 30 '22 at 15:55

answered Nov 30 '22 at 15:22

Santiago Capobianco

866
9
19

Thanks! In my code influential is a list of observations for each regression – i.b Nov 30 '22 at 15:32
Ok, if vare is a list of dataframes, and influential is a list of vectors, the code should work in the contetxt of a loop. – Santiago Capobianco Nov 30 '22 at 15:40
Or, if only influential is a list, you can unlist it to a vector, prior removing duplicates. – Santiago Capobianco Nov 30 '22 at 15:43

removing outliers after calculating cook distance for multiple regressions

1 Answers1