0

I have a data frame df and I tried to remove the outliers -888.88 with NAs, but my code didn't work, I didn't find the loophole. Hope someone could help.

df<- structure(list(Tsoil_11 = c(5.25, 5.27, 5.25), Tsoil_12 = c(5.53, 5.55, 5.55), Tsoil_13 = c(5.45, 5.47, -888.88, Tsoil_14 = c(5.36, 5.34, -888.88), Tsoil_21 = c(5.7, 5.72, 5.7), Tsoil_22 = c(5.83, 5.81, -888.88)), row.names = c(NA, 3L), class = "data.frame")

df %>% mutate_if(is.numeric, ~replace(., . %in% boxplot.stats(.)$out, NA))
LEE
  • 316
  • 2
  • 8
  • There are no outliers in your data also I couldn't see -888.88 value in your data. – Ronak Shah Feb 17 '21 at 12:29
  • @RonakShah Sorry for the stupid error, I just copied the wrong data.... I have edited my question. – LEE Feb 17 '21 at 12:58
  • It is risky to edit the `structure` call manually, as evidenced here: you have a mismatched paren. The fix might be simple, but it may be better to just copy the output from `dput` (subsetting data beforehand) than trying to create/edit by hand. – r2evans Feb 17 '21 at 13:10
  • 1
    And unfortunately, `boxplot.stats(df$Tsoil_14)$out` returns `numeric(0)`. It might be better to provide sample data programmatically instead of a subset that is too small to reproduce the statistical properties you desire. – r2evans Feb 17 '21 at 13:11
  • @r2evans yeah, that's so weird, I also checked my initial whole dataset, but when I use "boxplot.stats(df$Tsoil_14)$out", I found it returned my normal data instead of the outliers. – LEE Feb 17 '21 at 13:23
  • I have tried your code with randomly generated data for at least 20 rows, and it... works, notably the NAs, what is the issue at hand ? – Pierre Chevallier Feb 17 '21 at 13:24
  • How many records do you have in your initial dataset ? – Pierre Chevallier Feb 17 '21 at 13:26
  • @PierreChevallier I tried to replace all the outliers by using NAs. But the problem is that doesn't work for me now. I am puzzled. – LEE Feb 17 '21 at 13:27
  • @PierreChevallier I have tons of data records, that's why I subset them and change some values manually. – LEE Feb 17 '21 at 13:30
  • 1
    The issue here is that with just 3 records, you won't have outliers, we would need more records to at least get outliers from the sample you provided. – Pierre Chevallier Feb 17 '21 at 13:32

1 Answers1

4

"Why" it's not being filtered out is because statistically, it is not an outlier. While intuitively one might infer this since two values are around 5 and a third is -888, your use of boxplot.stats is using the definition of an outlier as

     out: the values of any data points which lie beyond the extremes
          of the whiskers ('if(do.out)').

Defined in ?boxplot, whiskers are informed by

   range: this determines how far the plot whiskers extend out from the
          box.  If 'range' is positive, the whiskers extend to the most
          extreme data point which is no more than 'range' times the
          interquartile range from the box. A value of zero causes the
          whiskers to extend to the data extremes.

Since range defaults to 1.5, this means that each whisker is 1.5 times the IQR (interquartile range) of the values, measured from the quartiles. You can follow along with the source for boxplot.stats (output truncated slightly for presentation, each step is stepped with n):

debugonce(boxplot.stats)

boxplot.stats(df$Tsoil_14)
# debugging in: boxplot.stats(df$Tsoil_14)

# Browse[2]> 
debug: if (coef < 0) stop("'coef' must not be negative")
# Browse[2]> 
debug: nna <- !is.na(x)
# Browse[2]> 
debug: n <- sum(nna)
# Browse[2]> 
debug: stats <- stats::fivenum(x, na.rm = TRUE)
# Browse[2]> 
debug: iqr <- diff(stats[c(2, 4)])
# Browse[2]> 
debug: if (coef == 0) do.out <- FALSE else { ...
# Browse[2]> 
debug: out <- if (!is.na(iqr)) {
#     x < (stats[2L] - coef * iqr) | x > (stats[4L] + coef * iqr)
# } else !is.finite(x)
# Browse[2]> 
debug: x < (stats[2L] - coef * iqr) | x > (stats[4L] + coef * iqr)
# Browse[2]> 
stats
# [1] -888.88 -441.77    5.34    5.35    5.36
# Browse[2]> 
iqr
# [1] 447.12

# Browse[2]> debug: x < (stats[2L] - coef * iqr) | x > (stats[4L] + coef * iqr)
# Browse[2]> 
out
# [1] FALSE FALSE FALSE

Okay, so stats includes our three values, plus two interpolated values, and that's where the problem is occurring for you: the 1.5*iqr for the lower-whisker starts from -441.77, so lower outliers are defined as being less than

# Browse[2]> 
(stats[2L] - coef * iqr)
# [1] -1112.45

This is not a problem with the statistics, that is a consistent interpretation, and "quartiles" (25%, 50%, 75%) of just three samples is an imperfect lens.

This problem goes away when you have more data:

rbind(df, df[1:2,]) %>%
  mutate_if(is.numeric, ~replace(., . %in% boxplot.stats(.)$out, NA))
#   Tsoil_11 Tsoil_12 Tsoil_13 Tsoil_14 Tsoil_21 Tsoil_22
# 1     5.25     5.53     5.45     5.36     5.70     5.83
# 2     5.27     5.55     5.47     5.34     5.72     5.81
# 3     5.25     5.55       NA       NA     5.70       NA
# 4     5.25     5.53     5.45     5.36     5.70     5.83
# 5     5.27     5.55     5.47     5.34     5.72     5.81

though that is obviously not the way to resolve this issue.

Your options include:

  • Get more data. As you saw above, if you get just two more rows (with consistent-looking values), then it might resolve itself. (Note that rbind(df, df[2:3,]) %>% ... does not resolve it, but that to me is a feature: if more data continues to demonstrate values like that, perhaps they are not true outliers;

  • Filter first. Often with data observations, there might be a range of possible values, outside of which is physically not possible. This is informed by the physics or chemistry or whatever of the problem, and not something I can inform here. For example, if this were pH, then we know it's possible range of values, anything outside of it can trivially be fixed. For example, with pH:

    ... %>% 
      mutate(pH = if_else(pH < 0, NA_real_, pH))
    
  • Outlier determination. There are theses and dissertations that research different methods of outlier-detection. There are R packages that employ some of these alternative methods. I won't speak for or against any of them, they often have their own merit (and limitations).


FYI, mutate_if has been superseded by across. The equivalent call would be

df %>%
   mutate(across(where(is.numeric),
                 ~ replace(., . %in% boxplot.stats(.)$out, NA)))

Another FYI, testing for equality with floating point is usually fine, but there is no guarantee that it will always work (See Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754.) It's often better to use tests of strict inequality and optionally a tolerance.

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 1
    I felt like that I have found the problem. According to your answer, I checked my data. The results showed that even with more data, my problem is still there. But you are right, the definition of the outliers is the problem. So filtering first really helped. – LEE Feb 17 '21 at 14:57