12

I've read this question here: Convert continuous numeric values to discrete categories defined by intervals

However, I would like to output a numeric (rather than a factor), specifically the numeric value of the lower and/or upper bounds (in separate columns)

In essence, this is right, except that the 'df$start' and 'df$end' are given as factors:

df$start <- cut(df$x, 
                breaks = c(0,25,75,125,175,225,299),
                labels = c(0,25,75,125,175,225),
                right = TRUE)

df$end <- cut(df$x, 
              breaks = c(0,25,75,125,175,225,299),
              labels = c(25,75,125,175,225,299),
              right = TRUE)

The use of as.numeric() returns the level of the factor (i.e. values 1-6) rather than the original numbers.

Henrik
  • 65,555
  • 14
  • 143
  • 159
Andrew
  • 516
  • 1
  • 6
  • 17
  • 4
    You could cast using `as.character` first, and then `as.numeric`. I feel like there should be a better solution to this problem, though. – user295691 Sep 02 '15 at 14:45

4 Answers4

14

Much of the behavior of cut is related to creating the labels that you're not interested in. You're probably better off using findInterval or .bincode.

You would start with the data

set.seed(17)
df <- data.frame(x=300 * runif(100))

Then set the breaks and find the intervals:

breaks <- c(0,25,75,125,175,225,299)
df$interval <- findInterval(df$x, breaks)
df$start <- breaks[df$interval]
df$end <- breaks[df$interval + 1]
user295691
  • 7,108
  • 1
  • 26
  • 35
  • 1
    +1 for `.bincode`. I had only seen `findInterval` before. Now trying to figure out what the key differences are between them. It looks like there is a difference in handling values at the breaks, they are moved up to the next level in `findInterval` but not `.bincode`. And points outside the breaks map to NA in `.bincode`, and 0 or N in `findInterval` under default arguments – C8H10N4O2 Sep 02 '15 at 18:48
  • @C8H10N4O2 this question had made me curious enough to go down to the source level; it really seems like these functions are practically identical in their algorithm and an interesting project would be to consolidate the implementation into a single function with options to support the required behaviors from both, probably by transforming the output of a single function. – user295691 Sep 02 '15 at 19:03
8

I'm guessing at what you want, since if you wanted the "original numbers", you could just use df$x. I presume you are after some number to reflect the group? In that guess, what about the following.

## Generate some example data
x = runif(5, 0, 300)
## Specify the labels
labels = c(0,25,75,125,175,225)
## Use cut as before
y = cut(x, 
    breaks = c(0,25,75,125,175,225,300),
    labels = labels,
    right = TRUE)

When we convert y to a numeric, this gives the index of the label. Hence,

labels[as.numeric(y)]

or simpler

labels[y]
alko989
  • 7,688
  • 5
  • 39
  • 62
csgillespie
  • 59,189
  • 14
  • 150
  • 185
  • 3
    In fact, probably better to save the breaks, and not use the labels at all -- if factor levels is all we need, it doesn't matter whether we use the autogenerated labels. So just `df$start <- breaks[cut(df$x, breaks=breaks, right=TRUE)]` – user295691 Sep 02 '15 at 14:50
  • Thanks both. Both the answer and the comment solve the problem @user295691 – Andrew Sep 02 '15 at 15:25
4

I would go for the usage of regex since all the information is in the output of cut.

cut_borders <- function(x){
pattern <- "(\\(|\\[)(-*[0-9]+\\.*[0-9]*),(-*[0-9]+\\.*[0-9]*)(\\)|\\])"

start <- as.numeric(gsub(pattern,"\\2", x))
end <- as.numeric(gsub(pattern,"\\3", x))

data.frame(start, end)
}

The pattern in words:

  • Group 1: either a ( or a [, so we use (\\(|\\[).

  • Group 2: number might be negative, so we (-*), we are looking for at least one number ([0-9]+) which can have decimal places, i.e. a point (\\.*) and decimals after point ([0-9]*).

  • Next there is a comma (,)

  • Group 3: same as group 2.

  • Group 4: analog to group 1 we are expecting either a ) or a ].

Here is some random variable cut with quantiles. The function cut_borders returns what we are looking for:

x <- rnorm(10)

x_groups <- cut(x, quantile(x, 0:4/4), include.lowest= TRUE)

cut_borders(x_groups)
  • that's nice, but the regex can be massively shortened. We know the cut pattern well. Thus we can for example use `".{1}(-?\\d+),(-?\\d+).{1}"` - no need to look for either [ or (, and the "-" can be made optional with ?. using \\d makes it slightly shorter than [0-9] – tjebo Apr 25 '21 at 23:23
  • This regex doesn't capture intervals with scientific notation, like `(1.26e+03,1.55e+03]`. – postylem Jul 19 '22 at 00:49
2

We can make use of tidyr::extract

library(tidyverse)
set.seed(17)
df <- data.frame(x = cut(300 * runif(100), c(0,25,75,125,175,225,299)))

df %>%
  extract(x, c("start", "end"), "(-?\\d+),(-?\\d+)")
#>     start end
#> 1      25  75
#> 2     225 299
#> 3     125 175
#> 4     225 299
#> 5      75 125
#> 6     125 175
#> ...

Created on 2021-05-11 by the reprex package (v2.0.0)

P.S. Thanks to user 295691 for the data and user machine for the first draft of the regex, which is modified here. Both +1 :)

tjebo
  • 21,977
  • 7
  • 58
  • 94