22

A simple question on a simple seemingly innocent function: summary.

Until I saw results for Min and Max that were outside the range of my data, I was unaware that summary has a digits argument to specify precision of the output results. My question is about how to address this in a clean, universal manner.

Here is an example of the issue:

set.seed(0)
vals    <- 1 + 10 * 1:50000
df      <- cbind(rnorm(10000),sample(vals, 10000), runif(10000))

Applying summary and range, we get the following output - notice the discrepancy in the range values versus the Min and Max:

    > apply(df, 2, summary)

                [,1]   [,2]      [,3]
    Min.    -3.703000     11 6.791e-05
    1st Qu. -0.668500 122800 2.498e-01
    Median   0.009778 248000 5.014e-01
    Mean     0.010450 248800 5.001e-01
    3rd Qu.  0.688800 374000 7.502e-01
    Max.     3.568000 499900 9.999e-01

    >     apply(df, 2, range)
            [,1]   [,2]         [,3]
    [1,] -3.703236     11 6.790622e-05
    [2,]  3.568101 499931 9.998686e-01

Seeing erroneous ranges in summary is a little disconcerting, so I looked at the digits option, but this is simply the standard notation for formatting output. Also note: Every single quantile other than Min shows a value that does not exist in the data set (this is why I put a 1 + in the definition for vals), nor would one see these quantiles in most standard quantile calculations, even allowing for differences in midpoint selection. (When I saw this in the original data, I wondered how I had lost a value of 1 from everything!)

There is a difference between explicable computational behavior (i.e. formatting and precision) and statistically-motivated expecations (such values identified as quantiles actually being within the range of the dataset). Since we can't change the expectations, we need to change the behavior of the code or at least improve it.

The question: Is there some more appropriate way to set the output to be sure of the range, other than setting it to a large value, e.g. digits = 16? Is 16 even the most appropriate universal default? Using 16 digits seems to be the best guarantee of precision for double floats, though it seems the output will not actually have 16 digits (the output still seems to be truncated to 8 or 9 digits).


Update 1: As @BrianDiggs has noted, via the links, the behavior is documented, but unexpected. To clarify my issue, relative to the answers on the link provided by Brian (excepting the answer by Brian himself): it's not that the behavior is undocumented, but it's flatly wrong to denote as Min and Max values which are not Min and Max. A documented function that gives incorrect output in its default settings needs to be used with non-default settings (or should not be used). (Maybe one could argue whether "Min" and "Max" should be renamed as "Approximate Min" and "Approximate Max", but let's not go there.)

Update 2: As @Dwin has noted, summary() takes as its default max(3, getOption("digits") - 3). I'd previously erred in saying the default was 3. What's interesting about this is that this implies two ways to set the behavior of the output. If we use both, the behavior gets weird:

> options(digits = 20)
> apply(df, 2, summary, digits = 10)

                             [,1]                  [,2]                      [,3]
Min.    -3.7032358429999998605808     11.00000000000000 6.7906221370000004927e-05
1st Qu. -0.6684710537000000396546 122798.50000000000000 2.4977348059999998631e-01
Median   0.0097783099960000001427 247971.00000000000000 5.0137970539999998643e-01
Mean     0.0104475229200000005458 248776.38699999998789 5.0011818200000002221e-01
3rd Qu.  0.6887842181000000119084 374031.00000000000000 7.5024240300000000214e-01
Max.     3.5681007909999999938577 499931.00000000000000 9.9986864070000003313e-01

Notice that this now has 20 digits of output, even though the argument passed specifies 10 digits of precision. If we set the global option for digits to be some "sane" value like 16, we still end up with issues if we provide summary with an argument of 10.

I believe the documentation is incomplete, and Brian Diggs has pointed out other issues with it in his thoughtful answer in the link to R-help.

Despite these wrinkles, the question remains open, but maybe it can't be answered. I suspect that the best result is simply to leave the global digits option as-is (though I am a little disturbed by the implications of the above behavior) and instead pass a value of 16 to summary. It isn't immediately obvious where the output precision is specified, but this interaction of 4 values - the global option (and the global option - 3), the passed value, and a hard-coded value of 12 in summary.data.frame looks like (have meRcy on my soul for saying this) a hack.

Update 3: I'm accepting DWin's answer - it led to me understanding how this sausage is made. Seeing what is going on, I don't think there's a way to do what I ask, without rewriting summary.

Iterator
  • 20,250
  • 12
  • 75
  • 111
  • 1
    See this recent discussion on the r-help mailing list on this topic: http://tolstoy.newcastle.edu.au/R/e15/help/11/10/8980.html – Brian Diggs Oct 20 '11 at 17:59
  • @BrianDiggs Thanks for the pointer. It's quite interesting, especially the issue of a hard-coded call in `summary.data.frame()`. Other than that Brian Diggs guy, the folks are nicer here on SO. :) – Iterator Oct 20 '11 at 18:10
  • 1
    I wouldn't say nicer ;-), just more patient w/ n00bs. BTW, IMHO the proper use of `summary` is indicated by its name: a way to get a quick-look at your data. When you want high-precision formats, use `table()` or `sprintf` , etc. Or if you really want high precision, take a look at `package:Rmpfr` . – Carl Witthoft Oct 20 '11 at 20:00
  • 1
    This is a useful discussion. I do think that the problem seems obvious at first sight, but the solution is not. "it's flatly wrong to denote as Min and Max values which are not Min and Max ...", if accepted as an unbreakable rule, means (I think) that we would always have to give min and max reports to their **full precision**. How would you quote the min/max of `c(-10.000001,0,10.00001)`? I agree that there is an issue of consistency, but I think it's hard to come up with a fully coherent solution. – Ben Bolker Oct 20 '11 at 20:04
  • 2
    @BenBolker Fair enough on the precision. I just don't like being misled about my data. :) If it shows 8 decimal places, but only 4 digits of precision, and refers to this as Min and Max, then I have been lied to. I am quite happy with explicable behavior, just not comfortable with defaults that are potentially misleading, especially as they can be improved, if not solved, rather easily. – Iterator Oct 20 '11 at 20:17
  • FWIW, see [these comments](http://chat.stackoverflow.com/transcript/message/1717033#1717033) in the discussion forum on explicable versus expected behavior. I am comfortable with computationally explicable behavior, even if statistically wrong, but to use words like Min/Max sets *statistical* expectations that are unfulfilled by the output. – Iterator Oct 20 '11 at 20:19
  • 2
    In thinking about this again, it occurs to me that the second column of your original apply should be reported as 11, 1.228e+05, 2.480e+05, 2.488e+05, 3.740e+05, and 4.999e+05. This notation would make it more clear/less surprising that these are rounded values. – Brian Diggs Oct 20 '11 at 21:17

1 Answers1

18

The default for summary.data.frame is not digits=3, but rather:

   ... max(3, getOption("digits") - 3)  # set in the argument list
getOption("digits")    # the default setting
[1] 7
options(digits=10)
> summary(df)
       V1                    V2                 V3              
 Min.   :-3.70323584   Min.   :    11.0   Min.   :6.790622e-05  
 1st Qu.:-0.66847105   1st Qu.:122798.5   1st Qu.:2.497735e-01  
 Median : 0.00977831   Median :247971.0   Median :5.013797e-01  
 Mean   : 0.01044752   Mean   :248776.4   Mean   :5.001182e-01  
 3rd Qu.: 0.68878422   3rd Qu.:374031.0   3rd Qu.:7.502424e-01  
 Max.   : 3.56810079   Max.   :499931.0   Max.   :9.998686e-01  
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Thanks for spotting the error (updated). Moreover, this answer highlights how one could address this - by setting the digits option to a different value. Yet, this introduces quite interesting behavior for `options(digits = 20)` and `apply(df, 2, summary, digits = 4)`. I am shaking my head at `summary`. – Iterator Oct 20 '11 at 19:49
  • 2
    Looking at this I would not have expected the default for summary.data.frame to come into play because you are not passing an object of class "data.frame" to summary. But I would have been wrong. And the output looks suspicious for being pathological. – IRTFM Oct 20 '11 at 20:23
  • Another great point! Who knew that sometimes matrices are passed to data frame code and sometimes they aren't? [This isn't the first time I wondered about this behavior...](http://stackoverflow.com/questions/7809570/why-is-running-unique-faster-on-a-data-frame-than-a-matrix-in-r) – Iterator Oct 20 '11 at 20:36
  • 1
    You've shown me how the sausage is made. I now think that doing what I sought requires as much effort as rewriting `summary`, and I'd rather try that (or just ignore it for now) than wrap it. – Iterator Oct 22 '11 at 04:02
  • Agreed. I think the dissatisfaction with summary is the reason for at least five different packages having as many different functions named `describe`. – IRTFM Nov 26 '14 at 14:53