0

I have a continuous variable that I want to split into bins, returning a numeric vector (of length equal to my original vector) whose values relate to the values of the bins. Each bin should have roughly the same number of elements.

This question: splitting a continuous variable into equal sized groups describes a number of techniques for related situations. For instance, if I start with

x = c(1,5,3,12,5,6,7)

I can use cut() to get:

cut(x, 3, labels = FALSE)
[1] 1 2 1 3 2 2 2

This is undesirable because the values of the factor are just sequential integers, they have no direct relation to the underlying original values in my vector.

Another possibility is cut2: for instance:

library(Hmisc)
cut2(x, g = 3, levels.mean = TRUE)
[1] 3.5 3.5 3.5 9.5 3.5 6.0 9.5

This better because now the return values relate to the values of the bins. It is still less than ideal though since:

  • (a) it yields a factor which then needs to be converted to numeric (see, e.g.), which is both slow and awkward code wise.
  • (b) Ideally I'd like to be able to choose whether to use the top or bottom end points of the intervals, instead of just the means.

I know that there are also options using regex on the factors returns from cut or cut2 to get the top or bottom points of the intervals. These too seem overly cumbersome.

Is this just a situation that requires some not-so-elegant hacking? Or, is there some easier functionality to accomplish this?

My current best effort is as follows:

MyDiscretize = function(x, N_Bins){
    f = cut2(x, g = N_Bins, levels.mean = TRUE)
    return(as.numeric(levels(f))[f])
}

My goal is to find something faster, more elegant, and easily adaptable to use either of the endpoints, rather than just the means.


Edit:

To clarify: my desired output would be:

  • (a) an equivalent to what I can achieve right now in the example with cut2 but without needing to convert the factor to numeric.

  • (b) if possible, the ability to also easily chose to use either of the endpoints of the interval, instead of the midpoint.

Community
  • 1
  • 1
Michael Ohlrogge
  • 10,559
  • 5
  • 48
  • 76

2 Answers2

1

Maybe not much elegant, but should be efficient. Try this function:

myCut<-function(x,breaks,retValues=c("means","highs","lows")) {
    retValues<-match.arg(retValues)
    if (length(breaks)!=1) stop("breaks must be a single number")
    breaks<-as.integer(breaks)
    if (is.na(breaks)||breaks<2) stop("breaks must greater than or equal to 2") 
    intervals<-seq(min(x),max(x),length.out=breaks+1)
    bins<-findInterval(x,intervals,all.inside=TRUE)
    if (retValues=="means") return(rowMeans(cbind(intervals[-(breaks+1)],intervals[-1]))[bins])
    if (retValues=="highs") return(intervals[-1][bins]) 
    intervals[-(breaks+1)][bins]
}
x = c(1,5,3,12,5,6,7)
myCut(x,3)
#[1]  2.833333  6.500000  2.833333 10.166667  6.500000  6.500000  6.500000
myCut(x,3,"highs")
#[1]  4.666667  8.333333  4.666667 12.000000  8.333333  8.333333  8.333333
myCut(x,3,"lows")
#[1] 1.000000 4.666667 1.000000 8.333333 4.666667 4.666667 4.666667
nicola
  • 24,005
  • 3
  • 35
  • 56
1

Use ave like this:

Given:

x = c(1,5,3,12,5,6,7)

Mean:

ave(x,cut2(x,g = 3), FUN = mean)
[1] 3.5 3.5 3.5 9.5 3.5 6.0 9.5

Min:

ave(x,cut2(x,g = 3), FUN = min)
[1] 1 1 1 7 1 6 7

Max:

ave(x,cut2(x,g = 3), FUN = max)
[1]  5  5  5 12  5  6 12

Or standard deviation:

ave(x,cut2(x,g = 3), FUN = sd)
[1] 1.914854 1.914854 1.914854 3.535534 1.914854       NA 3.535534

Note the NA result for only one data point in interval.

Hope this is what you need.

NOTE:
Parameter g in cut2 is number of quantile groups. Groups might not have the same amount of data points, and the intervals might not have the same length.
On the other hand, cut splits the interval into several of equal length.

R. Schifini
  • 9,085
  • 2
  • 26
  • 32
  • Ok, thank you, this is helpful, both the function and the note. I may have been sloppy in my terminology, - `cut2` is a way to get bins with *relatively* equal numbers of elements, right? – Michael Ohlrogge Sep 19 '16 at 17:04
  • 1
    Not really, if you look at the result of `cut2` the first group contains four elements, the second only one and the last two. Function `cut`does not guarantee that each group will have the same amount of elements. – R. Schifini Sep 19 '16 at 17:32
  • If you want to have the same quantity of elements, you should order them and then separate them into equally sized groups. – R. Schifini Sep 19 '16 at 17:34
  • Looking at the docs for `cut_number` from ggplot2, it looks like it *tries* for approximate equality, but doesn't guarantee it. `cut2` isn't explicit one way or the other, but it may perhaps be much the same. – Michael Ohlrogge Sep 19 '16 at 17:47