0

I am struggling with some strange behaviour in R, with the quantile function.

I have two sets of numeric data, and a custom boxplot stats function (which someone helped me write, so I am actually not too sure about every detail):

sample_lang = c(91, 122,  65,  90,  90, 102,
            98,  94,  84,  86, 108, 104,
            94, 110, 100,  86,  92,  92,
            124, 108,  82,  65, 102,  90, 114,
            88,  68, 112,  96,  84,  92,
            80, 104, 114, 112, 108,  68,
            92,  68,  63, 112, 116)

sample_vocab = c(96, 136,  81,  92,  95,
                 112, 101,  95,  97,  94,
                 117,  95, 111, 115,  88,
                 92, 108,  81, 130, 106,  
                 91,  95, 119, 103, 132, 103,
                 65, 114, 107, 108,  86, 
                 100,  98, 111, 123, 123, 117,
                 82, 100,  97,  89, 132, 114)

my.boxplot.stats <- function (x, coef = 1.5, do.conf = TRUE, do.out = TRUE) {
  if (coef < 0) 
    stop("'coef' must not be negative")
  nna <- !is.na(x)
  n <- sum(nna)
  #stats <- stats::fivenum(x, na.rm = TRUE)
  stats <- quantile(x, probs = c(0.15, 0.25, 0.5, 0.75, 0.85), na.rm = TRUE)
  iqr <- diff(stats[c(2, 4)])
  if (coef == 0) 
    do.out <- FALSE
  else {
    out <- if (!is.na(iqr)) {
      x < (stats[2L] - coef * iqr) | x > (stats[4L] + coef * 
                                            iqr)
    }
    else !is.finite(x)
    if (any(out[nna], na.rm = TRUE)) 
      stats[c(1, 5)] <- range(x[!out], na.rm = TRUE)
  }
  conf <- if (do.conf) 
    stats[3L] + c(-1.58, 1.58) * iqr/sqrt(n)
  list(stats = stats, n = n, conf = conf, out = if (do.out) x[out & 
                                                                nna] else numeric())
}

However, when I call quantile and my.boxplot.stats on the same set of data, I am getting different quantile results for the sample_vocab data (but it appears consistent with the sample_lang data), and I am not sure why:

> quantile(sample_vocab, probs = c(0.15, 0.25, 0.5, 0.75, 0.85), na.rm=TRUE)
  15%   25%   50%   75%   85% 
 89.6  94.5 101.0 114.0 118.4 
> 
> my.boxplot.stats(sample_vocab)
$stats
  15%   25%   50%   75%   85% 
 81.0  94.5 101.0 114.0 136.0 

Could someone help me understand what is happening? Please note, I am reasonably experienced with programming, but have no formal training in R, I am learning on my own.

Thanks so much in advance!

Terrence J
  • 151
  • 2
  • 12
  • 1
    Well, quantile does what the documentation says. Your function calls quantile, but then apparently `if (any(out[nna], na.rm = TRUE))` gets triggered and so the next line `stats[c(1, 5)] <- range(x[!out], na.rm = TRUE)` modifies the first and last values of `stats`, which are the values where you see differences. That's what is happening. What the boxplot stats code is correct or not (or what it's trying to do) isn't very clear. – Gregor Thomas Jul 08 '15 at 06:12
  • @Gregor - thanks, I think you are right. Do you know what: if (any(out[nna], na.rm = TRUE)) is actually doing? It seems like my sample_vocab is triggering this, but sample_lang is not? I am having trouble understanding the syntax.. – Terrence J Jul 08 '15 at 06:47

1 Answers1

1

The relevant bit of code is right here:

  if (coef == 0) 
    do.out <- FALSE
  else {
    out <- if (!is.na(iqr)) {
      x < (stats[2L] - coef * iqr) | x > (stats[4L] + coef * 
                                            iqr)
    }
    else !is.finite(x)
    if (any(out[nna], na.rm = TRUE)) 
      stats[c(1, 5)] <- range(x[!out], na.rm = TRUE)
  }

Basically, if coef != 0 (in your case coef is 1.5, the default function parameter), then the first and last elements of the reported quantiles are replaced with the minimum and maximum data value within coef * iqr of the 25% and 75% quantiles, where iqr is the distance between those quantiles.

josliber
  • 43,891
  • 12
  • 98
  • 133
  • Thanks for your help. But then why is the reported 85% for sample_lang not 124? If I am understanding correctly, the iqr is 22, so 108+1.5*22 = 130... – Terrence J Jul 08 '15 at 06:32
  • I reviewed, further where I got the help from: http://stackoverflow.com/questions/29070763/how-to-create-a-boxplot-in-r-with-box-representing-the-15th-and-85th-percentile - the majority of the code is in fact unchanged from the original boxplot.stats function, so perhaps there is a rule for when the 1st and 5th values are changed to the ones you precisely mentioned.. – Terrence J Jul 08 '15 at 06:59
  • The condition: if (any(out[nna], na.rm = TRUE)), seems to be changing those values. I am not sure exactly what this condition means, but seems like sample_lang does not meet the condition, but sample_vocab does. – Terrence J Jul 08 '15 at 07:45
  • Sorry, but I don't understand. The sample data, I provided above is an example as I mentioned. The sample_lang quantiles are not changed, but sample_vocab is. Noting my first comment to your answer, I don't understand why the quantiles for sample_lang are not changed, if what you are saying is true. – Terrence J Jul 09 '15 at 00:23