Why does `hist(..., nclass=nclass.scott)` fail in R?

Question

I had reported this to R-core, but they said (without explaining) that this is not a bug in R:

During automatic processing of some data, I came across an empty data set (or similar). Anyway, the hist() function used threw an error which looks like a syntax error to me (I'm an R beginner):

> df <- data.frame(n=c(0))
> str(df)
'data.frame':    1 obs. of  1 variable:
$ n: num 0
> hist(df$n) ### this one works!
> hist(df$n, nclass=nclass.scott)  ### this does not!
Error in if (h > 0) ceiling(diff(range(x))/h) else 1L :
 missing value where TRUE/FALSE needed
> df <- data.frame(n=c(0,1))
> hist(df$n, nclass=nclass.scott) ### this one works

Versions tested: 3.3.1 (linux) and 3.3.3 (Windows)

Without nclass=nclass.scott I don't get an error. I failed to find documentation for this parameter, however; I just found that histograms with this parameter look more appealing to me. With Google I found: "nclass.scott uses Scott's choice for a normal distribution based on the estimate of the standard error, unless that is zero where it returns 1"

I'm also expecting some robustness: In automatic processing you never know how much data a particular set will have, and I would prefer a histogram with a single bar in that case. Also compare these:

> hist(numeric(0))
Error in hist.default(numeric(0)) : invalid number of 'breaks'
> hist(numeric(1))
> hist(numeric(1), nclass=nclass.scott)
Error in if (h > 0) ceiling(diff(range(x))/h) else 1L : missing value where TRUE/FALSE needed
> hist(numeric(0), nclass=nclass.scott)
Error in if (h > 0) ceiling(diff(range(x))/h) else 1L : missing value where TRUE/FALSE needed

The function nclass.scott() should return something different when length(x) =1, but I don't see much point on making a histogram for such small sample sizes. — Edgar Santos, May 18 '17 at 08:10
Yes, this is not a bug. `help("nclass.scott")` does not claim that it works if the standard error is not defined. You should also be using the `breaks` parameter of `hist`. If this corner case is important to you, you can do `hist(df$n, breaks= if (length(df$n) == 1L) 1L else nclass.scott)`. — Roland, May 18 '17 at 08:40
@ed_sans: The mistake is to guess how a function will be used: I was visualizing the results of some automatic tests, where I had two subsets: One with at least partially successful tests, and the other with completely failed tests. As it turned out, the second subset was empty. — U. Windl, May 18 '17 at 09:24
@roland: Shouldn't `== 1L` be `<= 1L` for completeness? Also, what's the meaning of `L` in `1L`? — U. Windl, May 18 '17 at 09:31
The `L` forces the number to be an integer - http://stackoverflow.com/questions/24350733/why-would-r-use-the-l-suffix-to-denote-an-integer — Richard Telford, May 18 '17 at 10:03
@U.Windl If your code passes `NULL` or `numeric(0)`to `hist` you have bigger problems. I would want an error in such a case and `hist` will give you one anyway. — Roland, May 18 '17 at 10:32
Also, keep in mind, that histograms are a tool for data exploration which is interactive by definition. — Roland, May 18 '17 at 10:33
@Roland: If you automatically generate a several dozen of different plots from one data file, you want some robustness. Think of generating automated reports with a lot of graphics, and not of a single mathematician exploring data. — U. Windl, May 18 '17 at 11:49
I understand that. But it's unreasonable to expect R functions developed primarily for interactive use to have that robustness. The average R user wants an error for such corner cases. It you need robustness you are expected to implement it via `if` conditions and error handling (see `tryCatch`). PS: A histogram for one observation is pretty useless. — Roland, May 18 '17 at 12:11
@Roland: A histogram with one value says: "100% of the samples have that value"; what's wrong with that? — U. Windl, Mar 11 '19 at 20:55
@U.Windl I didn't say "wrong". I said "useless". If you have one value, show that value. A histogram only adds obfuscation, because instead of showing the exact value, you show a range. — Roland, Mar 12 '19 at 06:58

score 0 · Answer 1 · answered May 18 '17 at 08:10

0

A standard error can not be estimated with only one observation and it returns NA in this case which explains the error message about the missing value.

> sd(0)
[1] NA

> sd(c(1,1))
[1] 0

answered May 18 '17 at 08:10

theSZ

73
7

I see, but cannot be the functions be more robust: In automatic processing you never know how much data a particular set will have, and I would prefer an empty histogram in that case. – U. Windl May 18 '17 at 09:43
See Roland's comment for a robust solution. – theSZ May 18 '17 at 11:23
Roland's solution does not handle the case when the data set is empty. – U. Windl May 18 '17 at 11:46
As in your solution `if (length(df$n) > 1L)` handles this for you, but you might want to consider `if (length(df$n) > 0L)` so that `breaks=if (length(df$n) == 1L) 1L else nclass.scott)` is still meaningful? – theSZ May 18 '17 at 12:39

score -1 · Answer 2 · answered May 18 '17 at 11:54

-1

It seems the best solution (as things are now) is (combining Roland's with what I had):

if (length(df$n) > 1L) {
    hist(df$n, breaks=if (length(df$n) == 1L) 1L else nclass.scott)
} # else produce nothing

answered May 18 '17 at 11:54

U. Windl

3,480
26
54

Why does `hist(..., nclass=nclass.scott)` fail in R?

2 Answers2