0

I am trying to remove the outliers from my main dataframe after calculating the cook distance for multiple linear regressions however, after getting the influential values I don't know how to remove the influential values from my dataframe.

I am giving a reproducible example of what I would like to achieve:

data(varechem, package = "vegan")
data(varespec, package = "vegan")
vare <- cbind(varespec, varechem)

for (j in c("Callvulg", "Empenigr")){
  for (soil in c("N", "K",  "S", "P", "Fe")){
    mod_name <- paste( soil, j, sep="_")
    cook_name <- paste( soil, j, sep="_") 
    name <- paste( soil, j, sep="_")
    outliers <- paste( soil, j, sep="_")
    mod_cook[[mod_name]] <- lm(vare[,j] ~ vare[,soil], na.action = na.exclude)
    cook_distance[[cook_name]] <- cooks.distance(mod_cook[[mod_name]])
    sample_size <- nrow(vare)
    influential[[name]] <- which(cook_distance[[cook_name]] > 4/sample_size)
    names_influencial[[outliers]] <- names(influential[[name]])
    
  }
}

Now, I have the influential values, which I suppose are the rows numbers and I want to delete these rows from my vare data frame, is there a way to achieve this last step? Thanks a lot!

i.b
  • 167
  • 11
  • 2
    It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input that can be used to test and verify possible solutions. – MrFlick Nov 30 '22 at 14:48
  • 3
    The solution is not to delete outliers because they do not fit your model! Change your model instead of changing your data. – user2974951 Nov 30 '22 at 15:04
  • I understand your point, but still I would like to know how to remove these rows from the main data frame, because I was not able to find a coding solution when using the for loop with multiple regressions – i.b Nov 30 '22 at 15:21

1 Answers1

0

Assuming that influential is the vector of indices of the observations you want to remove, and vare is a data.frame, then you can do:

vare <- vare[-influential,]

Edit:

If influential is a list of vector of indices:

influential <- unique(unlist(influential))
vare <- vare[-influential,]

Edit 2: replace ! for -, oops

Hope it helps