"Why" it's not being filtered out is because statistically, it is not an outlier. While intuitively one might infer this since two values are around 5 and a third is -888, your use of boxplot.stats
is using the definition of an outlier as
out: the values of any data points which lie beyond the extremes
of the whiskers ('if(do.out)').
Defined in ?boxplot
, whiskers are informed by
range: this determines how far the plot whiskers extend out from the
box. If 'range' is positive, the whiskers extend to the most
extreme data point which is no more than 'range' times the
interquartile range from the box. A value of zero causes the
whiskers to extend to the data extremes.
Since range
defaults to 1.5, this means that each whisker is 1.5 times the IQR (interquartile range) of the values, measured from the quartiles. You can follow along with the source for boxplot.stats
(output truncated slightly for presentation, each step is stepped with n
):
debugonce(boxplot.stats)
boxplot.stats(df$Tsoil_14)
# debugging in: boxplot.stats(df$Tsoil_14)
# Browse[2]>
debug: if (coef < 0) stop("'coef' must not be negative")
# Browse[2]>
debug: nna <- !is.na(x)
# Browse[2]>
debug: n <- sum(nna)
# Browse[2]>
debug: stats <- stats::fivenum(x, na.rm = TRUE)
# Browse[2]>
debug: iqr <- diff(stats[c(2, 4)])
# Browse[2]>
debug: if (coef == 0) do.out <- FALSE else { ...
# Browse[2]>
debug: out <- if (!is.na(iqr)) {
# x < (stats[2L] - coef * iqr) | x > (stats[4L] + coef * iqr)
# } else !is.finite(x)
# Browse[2]>
debug: x < (stats[2L] - coef * iqr) | x > (stats[4L] + coef * iqr)
# Browse[2]>
stats
# [1] -888.88 -441.77 5.34 5.35 5.36
# Browse[2]>
iqr
# [1] 447.12
# Browse[2]> debug: x < (stats[2L] - coef * iqr) | x > (stats[4L] + coef * iqr)
# Browse[2]>
out
# [1] FALSE FALSE FALSE
Okay, so stats
includes our three values, plus two interpolated values, and that's where the problem is occurring for you: the 1.5*iqr
for the lower-whisker starts from -441.77
, so lower outliers are defined as being less than
# Browse[2]>
(stats[2L] - coef * iqr)
# [1] -1112.45
This is not a problem with the statistics, that is a consistent interpretation, and "quartiles" (25%, 50%, 75%) of just three samples is an imperfect lens.
This problem goes away when you have more data:
rbind(df, df[1:2,]) %>%
mutate_if(is.numeric, ~replace(., . %in% boxplot.stats(.)$out, NA))
# Tsoil_11 Tsoil_12 Tsoil_13 Tsoil_14 Tsoil_21 Tsoil_22
# 1 5.25 5.53 5.45 5.36 5.70 5.83
# 2 5.27 5.55 5.47 5.34 5.72 5.81
# 3 5.25 5.55 NA NA 5.70 NA
# 4 5.25 5.53 5.45 5.36 5.70 5.83
# 5 5.27 5.55 5.47 5.34 5.72 5.81
though that is obviously not the way to resolve this issue.
Your options include:
Get more data. As you saw above, if you get just two more rows (with consistent-looking values), then it might resolve itself. (Note that rbind(df, df[2:3,]) %>% ...
does not resolve it, but that to me is a feature: if more data continues to demonstrate values like that, perhaps they are not true outliers;
Filter first. Often with data observations, there might be a range of possible values, outside of which is physically not possible. This is informed by the physics or chemistry or whatever of the problem, and not something I can inform here. For example, if this were pH, then we know it's possible range of values, anything outside of it can trivially be fixed. For example, with pH:
... %>%
mutate(pH = if_else(pH < 0, NA_real_, pH))
Outlier determination. There are theses and dissertations that research different methods of outlier-detection. There are R packages that employ some of these alternative methods. I won't speak for or against any of them, they often have their own merit (and limitations).
FYI, mutate_if
has been superseded by across
. The equivalent call would be
df %>%
mutate(across(where(is.numeric),
~ replace(., . %in% boxplot.stats(.)$out, NA)))
Another FYI, testing for equality with floating point is usually fine, but there is no guarantee that it will always work (See Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754.) It's often better to use tests of strict inequality and optionally a tolerance.