0

I would like to plot a histogram where the buckets are specified by the following:

• 4.25 ≤ E < 4.75
• 4.75 ≤ E < 4.90
• 4.90 ≤ E ≤ 5.10
• 5.10 < E < 5.25
• 5.25 ≤ E ≤ 5.75

Notice how the equality jumps between the left and right bound. How can I do this in code?

MKhowaja
  • 62
  • 1
  • 10

1 Answers1

1

As far as I know, there is no cutting/breaking function in base R that allows you to specify such irregular breaks like that. You could wrap findInterval to do some of the manupulations

findInterval2 <- function(x, br, rightmost.closed = FALSE, left.closed=TRUE,
    trim=FALSE, labels=NULL) {
    r <- findInterval(x, br, rightmost.closed)
    closed.left <- c(rep_len(left.closed, length(br)), rightmost.closed)
    m <- x %in% br
    slideright <- m & r==0 & !left.closed[1]
    r[slideright] <- r[slideright] + 1
    slideleft <- which(m & r!=0 & !left.closed[ifelse(r==0,NA,r)])
    r[slideleft] <- r[slideleft]-1
    rng <- 0:length(br)
    if(trim) {
        r[r<1 | r>length(br)-1] <- NA
        rng <- 1:(length(br)-1)
    }
    if (is.null(labels) || (is.logical(labels) && labels==TRUE)) {
        ff <- format(embed(br,2))
        labels <- paste0(
            ifelse(left.closed, "[","("), 
            ff[,2], ", ", ff[,1], 
           ifelse(c(left.closed[-1], rightmost.closed), ")","]")
        )
        if(!trim) {
            labels <- c(
                paste0("(-Inf,", ff[1,2], ifelse(left.closed[1], ")","]") ),
                labels,
                paste0( ifelse(rightmost.closed, "[","("), ff[nrow(ff),1], ", Inf)" )
            )
        }
    } else if (is.logical(labels) && labels==FALSE) {
        labels = NULL
    }
    if (!is.null(labels)) {
        r <- factor(r, levels=rng, labels=labels)
    }
    r
}

With a list of breaks br<-c(4.25 ,4.75, 4.90,5.10, 5.25, 5.75), the normal behavior of findInterval creates breaks/labels with

  • -inf < x < 4.25: 0
  • 4.25 <= x < 4.75: 1
  • 4.75 <= x < 4.90: 2
  • 4.90 <= x < 5.10: 3
  • 5.10 <= x < 5.25: 4
  • 5.25 <= x < 5.75: 5
  • x>=5.75 : 6

However, if we add our new parameter left.closed, we an specify if each of the regions specified by the pairs of break values should be left closed (the default) or right closed. This vector should have a length one less than the length of the break vector.

We could get the breaks you desire with

rr <- findInterval2(x, br, rightmost.closed=FALSE, 
    left.closed=c(T, T, T, F, T), trim=TRUE) 

which should create

  • 4.25 <= E < 4.75: 1
  • 4.75 <= E < 4.90: 2
  • 4.90 <= E <= 5.10: 3
  • 5.10 < E < 5.25: 4
  • 5.25 <= E <= 5.75: 5

Note that testing for exact matches with numeric (decimal) values is very messy. So doing this stuff with continuous data is potentially flawed.

Also note that this doesn't necessarily apply directly to histograms. This function can be used for binning and then creating a barplot if you would like to visualize the data. Histograms are really only for estimating the underlying density of continuous random variables and if are are being this picky about break points, it seems like your data may be more discrete and you are interested in counts rather than densities.

For example, we can create test data with

set.seed(15)
br <- c(4.25,4.75, 4.90, 5.10, 5.25, 5.75)
x <- runif(45, min(br), max(br))
rr <- findInterval2(x, br, rightmost.closed=FALSE, 
    left.closed=c(T, T, T, F, T), trim=TRUE)

barplot(table(rr))

enter image description here

Community
  • 1
  • 1
MrFlick
  • 195,160
  • 17
  • 277
  • 295