Is there a way to do something like a cut()
function for binning numeric values in a dplyr
table? I'm working on a large postgres table and can currently either write a case statement in the sql at the outset, or output unaggregated data and apply cut()
. Both have pretty obvious downsides... case statements are not particularly elegant and pulling a large number of records via collect()
not at all efficient.
Asked
Active
Viewed 3.2k times
48

smci
- 32,567
- 20
- 113
- 146

Michael Williams
- 1,125
- 2
- 9
- 13
-
5Not yet, but it's on the to do list. – hadley Feb 11 '14 at 22:34
-
8I guess @hadley has something more in mind but this works: `BOD %.% group_by(cut(Time, 2)) %.% summarise(mean = mean(demand))` – G. Grothendieck Feb 11 '14 at 23:22
-
2Related: http://stackoverflow.com/questions/35204157/r-creating-a-categorical-variable-from-a-numerical-variable-and-custom-open-end – Berk U. Feb 08 '16 at 02:47
-
I think I've a answered this here in response to [Berk U](http://stackoverflow.com/users/568249/berk-u). http://stackoverflow.com/questions/35204157/r-creating-a-categorical-variable-from-a-numerical-variable-and-custom-open-end/38960967#38960967 – sean Aug 15 '16 at 18:54
-
1@hadley is this still planned? – its.me.adam Oct 26 '22 at 19:14
3 Answers
45
Just so there's an immediate answer for others arriving here via search engine, the n-breaks form of cut
is now implemented as the ntile
function in dplyr
:
> data.frame(x = c(5, 1, 3, 2, 2, 3)) %>% mutate(bin = ntile(x, 2))
x bin
1 5 2
2 1 1
3 3 2
4 2 1
5 2 1
6 3 2

Roman Cheplyaka
- 37,738
- 7
- 72
- 121

drhagen
- 8,331
- 8
- 53
- 82
-
-
@vagabond it returns `NA`. Check out the example in the `ntile` documentation page. – fmic_ May 26 '17 at 20:44
-
11Strictly, this is not a general `cut()` function for arbitrary breaks, it's only for ntiles (*"a rough rank, which breaks the input vector into ‘n’ buckets"*) – smci Aug 07 '18 at 01:40
17
I see this question was never updated with the tidyverse
solution so I'll add it for posterity.
The function to use is cut_interval
from the ggplot2
package. It works similar to base::cut
but it does a better job of marking start and end points than the base
function in my experience because cut
increases the range by 0.1% at each end.
data.frame(x = c(5, 1, 3, 2, 2, 3)) %>% mutate(bin = cut_interval(x, n = 2))
x bin
1 5 (3,5]
2 1 [1,3]
3 3 [1,3]
4 2 [1,3]
5 2 [1,3]
6 3 [1,3]
You can also specify the bin width with cut_width
.
data.frame(x = c(5, 1, 3, 2, 2, 3)) %>% mutate(bin = cut_width(x, width = 2, center = 1))
x bin
1 5 (4,6]
2 1 [0,2]
3 3 (2,4]
4 2 [0,2]
5 2 [0,2]
6 3 (2,4]

hmhensen
- 2,974
- 3
- 22
- 43
11
The following works with dplyr
, assuming x
is the variable we wish to bin:
# Make n bins
df %>% mutate( x_bins = cut( x, breaks = n )
# Or make specific bins
df %>% mutate( x_bins = cut( x, breaks = c(0,2,6,10) )

al_the_man
- 295
- 4
- 6