0

I would like to write a function that creates a binning variable based on some raw data. Specifically, I have a dateset with the age values for each respondent and I would like to write a function that classifies that person into an age group, where the age group is a parameter of that function.

This is what I started with:

data <- data.frame(age = 18:100)

foo <- function(data, brackets = list(18:24, 25:34, 35:59)) {
  require(tidyverse)
  tmp <- data %>%
    drop_na(age) %>%
    mutate(age_bracket = case_when(age %in% brackets[[1]] ~ paste(brackets[[1]][1], "to", brackets[[1]][length(brackets[[1]])]),
                                   age %in% brackets[[2]] ~ paste(brackets[[2]][1], "to", brackets[[2]][length(brackets[[2]])]),
                                   age %in% brackets[[3]] ~ paste(brackets[[3]][1], "to", brackets[[3]][length(brackets[[3]])])))
print(tmp)
}

As is obvious, the case_when part is very inflexible as I have to specify ahead of time the number of brackets. It is also quite lengthy. I would like to write some sort of loop that looks at the number of elements in the brackets argument and creates these brackets accordingly. So if I wanted to add a 60:Inf age group, the function should add another age group.

After searching online, I found that some use defused expressions (e.g. quos). I am quite unfamiliar with those, so I struggle to use them for my purpose.

Tea Tree
  • 882
  • 11
  • 26
  • 1
    Would it be better to use `cut` and control its `labels=`? – r2evans Mar 27 '20 at 22:40
  • 1
    BTW, you might not want to try to do `age %in% 60:Inf` ... – r2evans Mar 27 '20 at 22:52
  • 1
    Though the labels themselves can be tweaked a little, I suspect that `cut(data$age, c(0, 18, 25, 35, 60, Inf), include.lowest = TRUE)` gives you much of what this appears to be doing. – r2evans Mar 27 '20 at 22:54
  • 1
    BTW: your use of `require(tidyrverse)` is ill-advised. If all of `tidyverse` is not available, the function continues unabated (since `require` does not `stop` on error), but it may fail on the next line if either `dplyr` or `tidyr` failed. Recommendations: (1) only require the packages you need, `dplyr` and `tidyr` here; and one of (2a) use `library`, so that if load fails there will be a meaningful error message, or (2b) use `if (!require(...)) { fail_action }` so that you can actually *use* what `require` is intended for. Ref: https://stackoverflow.com/a/51263513/3358272 – r2evans Mar 27 '20 at 22:56
  • Cut works great with one small caveat. I don't like the standard labels that cut produces. How can I prettify them, e.g. instead of "(18, 25]", "18 to 25"? And the last one should read "60 and above". – Tea Tree Mar 27 '20 at 23:19
  • As I said in my first comment, `cut(..., labels=)` will work. Something like `v <- c(0,18,25,35,60,Inf); paste(v[-length(v)], v[-1], sep = " to ")` – r2evans Mar 27 '20 at 23:29
  • How can I control the overlap? Using this code will give me 18-24, 24-34... when it should produce 18-24, 25-34... – Tea Tree Mar 27 '20 at 23:36
  • 1
    `paste(v[-length(v)], v[-1]-1, sep = " to ")` – r2evans Mar 28 '20 at 00:41

1 Answers1

0

I think you are looking for the cut function. The following makes the job:

data <- data.frame(age = 18:100)

data$age_bracket <- cut(data$age, breaks = c(0, 18, 25, 35, 60, Inf))

unique(data$age_bracket)
# [1] (0,18]   (18,25]  (25,35]  (35,60]  (60,Inf]
# Levels: (0,18] (18,25] (25,35] (35,60] (60,Inf]

You can also define labels if you don't link brackets default labels. The advantage of using cut rather than hand-coded solution is that you make usual operations (e.g. ordering) with the output of cut

linog
  • 5,786
  • 3
  • 14
  • 28