glm - outlier detection and removal in R

Question

I constructed a binary logistic model. The response variable is binary. There are 4 regressors - 2 binary and 2 integers. I want to find the outliers and delete them. For this i have create some plots:

  par(mfrow = c(2,2))
  plot(hat.ep,rstudent.ep,col="#E69F00", main="hat-values versus studentized residuals",
       xlab="Hat value", ylab="Studentized residual")
  dffits.ep <- dffits(model_logit)
  plot(id,dffits.ep,type="l", col="#E69F00", main="Index Plot",
       xlab="Identification", ylab="Diffits")
  cov.ep <- covratio(model_logit)
  plot(id,cov.ep,type="l",col="#E69F00",  main="Covariance Ratio",
       xlab="Identification", ylab="Covariance Ratio")
  cook.ep <- cooks.distance(model_logit)
  plot(id,cook.ep,type="l",col="#E69F00", main="Cook's Distance",
       xlab="Identification", ylab="Cook's Distance")

According to the plots there is an outlier. How can I identify which observation is the outlier?

I have tried :

>   outlierTest(model_logit)
No Studentized residuals with Bonferonni p < 0.05
Largest |rstudent|:
     rstudent unadjusted p-value Bonferonni p
1061 1.931043           0.053478           NA

Are there some other functions for outlier detection?

You may find the `identify` function useful. – Apr 27 '18 at 10:56 — , Apr 27 '18 at 10:56

mnm · Answer 1 · 2018-07-03T13:18:21.660

Well this answer comes quite late. I'm unsure if you have found the answer or not. Continuing further, in the absence of a minimum reproducible example, I'll attempt to answer the question using some dummy data and two custom functions. For a given continuous variable, outliers are those observations that lie outside of 1.5*IQR, where IQR, the ‘Inter Quartile Range’ is the difference between the 75th and 25th quartiles. I also recommend you to see this post containing far better solutions than my crude answer.

> df <- data.frame(X = c(NA, rnorm(1000), runif(20, -20, 20)), Y = c(runif(1000),rnorm(20, 2), NA), Z = c(rnorm(1000, 1), NA, runif(20)))
> head(df)
         X      Y      Z
1       NA 0.8651 0.2784
2 -0.06838 0.4700 2.0483
3 -0.18734 0.9887 1.8353
4 -0.05015 0.7731 2.4464
5  0.25010 0.9941 1.3979
6 -0.26664 0.6778 1.1277

> boxplot(df$Y) # notice the outliers above the top whisker

Now, I'll create a custom function to detect the outliers and the other function will replace the outlier values with NA.

# this function will return the indices of the outlier values
> findOutlier <- function(data, cutoff = 3) {
  ## Calculate the sd
  sds <- apply(data, 2, sd, na.rm = TRUE)
  ## Identify the cells with value greater than cutoff * sd (column wise)
  result <- mapply(function(d, s) {
    which(d > cutoff * s)
  }, data, sds)
  result
}

# check for outliers
> outliers <- findOutlier(df)

# custom function to remove outliers
> removeOutlier <- function(data, outliers) {
  result <- mapply(function(d, o) {
    res <- d
    res[o] <- NA
    return(res)
  }, data, outliers)
  return(as.data.frame(result))
}

> filterData<- removeOutlier(df, outliers)
> boxplot(filterData$Y)

glm - outlier detection and removal in R

1 Answers1