3

I'm currently trying to neatly cut data with use of the Hmisc package, as in the example below:

dummy <- data.frame(important_variable=seq(1:1000))
require(Hmisc)
dummy$cuts <- cut2(dummy$important_variable, g = 4)

The produced cuts are correct with respect to the values:

  important_variable       cuts
1                  1 [  1, 251)
2                  2 [  1, 251)
3                  3 [  1, 251)
4                  4 [  1, 251)
5                  5 [  1, 251)
6                  6 [  1, 251)
> table(dummy$cuts)
[  1, 251) [251, 501) [501, 751) [751,1000] 
       250        250        250        250 

However, I would like for the data to be presented slightly differently. For instance instead of

[ 1, 251 )

[ 251, 501 )

I would prefer the notation

1 - 250

251 - 500

As I'm doing a lot of that on multiple variables I'm interested in a reproducible solution that would be easy to apply across multiple variables.


Edit

Following the discussion in comments, the solution would have to work on more messy variables, like x2 <- runif(100, 5.0, 7.5).

Community
  • 1
  • 1
Konrad
  • 17,740
  • 16
  • 106
  • 167
  • @akrun it crossed my mind but it would be rather complex approach. I would have to decrease the value of the last figure in a previous group, increase the value in the next group and replace a number of characters (easiest bit). As I'm doing that a lot the groups differ in the way they are constructed. – Konrad Aug 02 '15 at 12:21
  • 2
    Or in one step `gsubfn('\\[\\s+(\\d+), (\\d+)\\)', ~paste0(x, '-', as.numeric(y)-1), as.character(v1))` – akrun Aug 02 '15 at 12:25
  • 1
    What would be the expected result for `x2`? – akrun Aug 02 '15 at 12:27
  • @akrun a cut with no **[ )** signs denoting the membership but having the smallest/highest value that belongs to each class in each category. – Konrad Aug 02 '15 at 12:29
  • 1
    this is more "tedious" than "complex". it's a pretty straightforward solution, even for the generic case. did you try something before posting that didn't work? – hrbrmstr Aug 02 '15 at 12:41
  • 1
    @akurn, I'm using your solution for integers so it would acceptable answer. *Edit:* brilliant, thanks very much :) – Konrad Aug 02 '15 at 12:43

2 Answers2

4

We could use gsubfn to remove the parentheses as well as change the numeric part by subtracting one from the second set of numbers

 library(gsubfn)
 v1 <- dummy$cuts
 v1New <-  gsubfn('\\[\\s*(\\d+),\\s*(\\d+)[^0-9]+', ~paste0(x, '-', 
                     as.numeric(y)-1), as.character(v1))
 table(v1New)
 # 1-250 251-500 501-750 751-999 
 #  250     250     250     250 

For the second case involving decimals, we need to match the numbers along with decimals and capture those groups by placing them in parentheses (([0-9.]+), (\\d+\\.\\d+)). We change the second set of capture group by converting to 'numeric' and subtracting 0.01 from it (as.numeric(y)-0.01). The \\s* denotes 0 or more spaces. The spaces was uneven in the format, so we had to use that instead of \\s+ which is 1 or more spaces.

 v2New <- gsubfn('\\[\\s*([0-9.]+),(\\d+\\.\\d+).*', ~paste0(x,
                 '-',as.numeric(y)-0.01), as.character(v2))
 table(v2New)
 v2New
 #5.00-5.59 5.60-6.12 6.13-6.71 6.72-7.49 
 #    25        25        25        25 

data

 set.seed(24)
 x2 <- runif(100, 5.0, 7.5)
 v2 <- cut2(x2, g=4)
akrun
  • 874,273
  • 37
  • 540
  • 662
  • With respect to the syntax `[0-9.]` how can I modify it to match real numbers of format **999.878**? I'm facing a situation where where I've to mange numbers with decimal places that have more than one digit at the beginning. – Konrad Dec 10 '15 at 16:02
  • 1
    @Konrad Try `str1 <- '[ 999.878, 1001.35 )' ; gsubfn('\\[\\s*([0-9]+\\.*[0-9]*),\\s*(\\d+\\.\\d+).*', ~paste0(x, '-', as.numeric(y)- 0.01), str1) #[1] "999.878-1001.34"` – akrun Dec 10 '15 at 19:21
  • Is it possible to modify the `gsubfn` syntax so it will work on the negative values? I.e my brackets would be `[-0.6, -0.4)`. – Konrad Jan 18 '16 at 16:41
  • @Konrad Can you post that as a new question as this question is a bit old. – akrun Jan 18 '16 at 18:21
  • Thanks, I'll have a try a few more times and post if no success. – Konrad Jan 18 '16 at 18:35
3

This provides a generic solution for integer and decimal ranges (without needing to specify the increment by hand):

library(stringr)

pretty_cuts <- function(cut_str) {

  # so we know when to not do something

  first_val <- as.numeric(str_extract_all(cut_str[1], "[[:digit:]\\.]+")[[1]][1])
  last_val <- as.numeric(str_extract_all(cut_str[length(cut_str)], "[[:digit:]\\.]+")[[1]][2])

  sapply(seq_along(cut_str), function(i) {

    # get cut range

    x <- str_extract_all(cut_str[i], "[[:digit:]\\.]+")[[1]]

    # see if a double vs an int & get # of places if decimal so
    # we know how much to inc/dec

    inc_dec <- 1
    if (str_detect(x[1], "\\.")) {
      x <- as.numeric(x)
      inc_dec <- 10^(-match(TRUE, round(x[1], 1:20) == x[1]))
    } else {
      x <- as.numeric(x)
    }

    # if not the edge cases inc & dec

    if (x[1] != first_val) { x[1] <- x[1] + inc_dec }
    if (x[2] != last_val)  { x[2] <- x[2] - inc_dec }

    sprintf("%s - %s", as.character(x[1]), as.character(x[2]))

  })

}

dummy <- data.frame(important_variable=seq(1:1000))
dummy$cuts <- cut2(dummy$important_variable, g = 4)
a <- pretty_cuts(dummy$cuts)

unique(dummy$cuts)
## [1] [  1, 251) [251, 501) [501, 751) [751,1000]
## Levels: [  1, 251) [251, 501) [501, 751) [751,1000]

unique(a)
## [1] "1 - 250"    "252 - 500"  "502 - 750"  "752 - 1000"

x2 <- runif(100, 5.0, 7.5)
b <- pretty_cuts(cut2(x2, g=4))

unique(cut2(x2, g=4))
## [1] [5.54,6.28) [6.28,6.97) [6.97,7.50] [5.02,5.54)
## Levels: [5.02,5.54) [5.54,6.28) [6.28,6.97) [6.97,7.50]

unique(b)
## [1] "5.54 - 6.27" "6.29 - 6.97" "6.98 - 7.49" "5.03 - 5.53"
hrbrmstr
  • 77,368
  • 11
  • 139
  • 205