I am trying to remove the outliers from my main dataframe after calculating the cook distance for multiple linear regressions however, after getting the influential values I don't know how to remove the influential values from my dataframe.
I am giving a reproducible example of what I would like to achieve:
data(varechem, package = "vegan")
data(varespec, package = "vegan")
vare <- cbind(varespec, varechem)
for (j in c("Callvulg", "Empenigr")){
for (soil in c("N", "K", "S", "P", "Fe")){
mod_name <- paste( soil, j, sep="_")
cook_name <- paste( soil, j, sep="_")
name <- paste( soil, j, sep="_")
outliers <- paste( soil, j, sep="_")
mod_cook[[mod_name]] <- lm(vare[,j] ~ vare[,soil], na.action = na.exclude)
cook_distance[[cook_name]] <- cooks.distance(mod_cook[[mod_name]])
sample_size <- nrow(vare)
influential[[name]] <- which(cook_distance[[cook_name]] > 4/sample_size)
names_influencial[[outliers]] <- names(influential[[name]])
}
}
Now, I have the influential values, which I suppose are the rows numbers and I want to delete these rows from my vare
data frame, is there a way to achieve this last step?
Thanks a lot!