What is an efficient way to calculate a category-specific metric for a variety of levels of variable granularity?

Question

To make my abstract question more tangible, I want to calculate stratum-specific incidence rates of a particular disease for each age stratum. But I want to be able to write flexible code that can adapt to a variety of user-specified levels of granularity in age categorization. For instance, I might be interested in calculating incidence among children <2 mos of age, 2-3 mos, 4-5 mos, 6-11 mos, 12-23 mos, and 24-59 mos (most granular treatment of age). But I also want to perform the same calculation for a coarser treatment of age (i.e., <12 mos of age, 12-23 mos, 24-59 mos). So I want to be able to change the cutpoints at will.

Example:

    incidence <- (age group freq)/(census pop for age group) * 100000

I am thinking that I could set a control structure function (i.e., if age < 12 mo then agegroup=1, etc) as an object and reference that object in an *apply function (given that I'm wanting to calculate these values across a number of dataframes), but I'm wondering if there is a better way to solve this problem. Am happy if you reference me to papers or give a specific answer here. Thank you.

I guess you're looking for `?cut`; `cut(1:10, c(-Inf, 5, 8, Inf))` — alexis_laz, Aug 12 '14 at 16:43
At the moment I'm thinking your question is too general, but there are quite a few examples in SO using cut(). My preference is cut2 in the 'Hmisc'-package because it's defaults match my preference for how the cuts are made. It also has a compact implementation of a grouping function based on `quantile`. — IRTFM, Aug 12 '14 at 16:48
I'm constantly amazed at the little R functions I learn about that make things so much easier! This is great. Thanks @alexis_laz. For clarification, I guess I would assign this function to an object and then run the calculation using the object. Then when I wanted to change the cutpoints, I'd just change the object and then re-run the code? — mcjudd, Aug 12 '14 at 16:50
Could you add an example in your question to show the exact manipulation you want to do per age-stratum? If you search for `cut` or `findInterval` combined with `aggregate` or `tapply` or `split` - `lapply` you'll find plenty of material to guide you through your work. From what I understand, perhaps you're looking for something like `x = sample(1:80, 1e3, T); table(cut(x, c(-Inf, 5, 20, 50, Inf))); table(cut(x, c(-Inf, 2, 5, 40, 50, Inf)))`? — alexis_laz, Aug 12 '14 at 17:08

score 1 · Answer 1 · answered Aug 12 '14 at 17:16

Following may be helpful. Consider ages of newly diagnosed cases. You can 'cut' at different breaks:

age = sample(1:60, 100, replace=T)
age
  [1] 57 50 52 18 18 15 48 36  5 45 25 44 23 60 36 27 43 41 23 10 41 40 58  5 55 29 21 41 16 15 40 55 52 15 53  3 13 57 37 49 33
 [42] 34 54 25 28  5 23 43 50 12  9 42 40 25 29 51 39 59  3 19 11 17 35  4 41 45 28 14  5 36 13 56 33  7 55  5 11 34 47 46 44 26
 [83] 56 55 13 59 57 60 37 51 47 40 39 28 33  4 28 43 20 24

table(cut(age, breaks=c(0,1,10,25,50,60)))

  (0,1]  (1,10] (10,25] (25,50] (50,60] 
      0      12      24      44      20 

barplot(table(cut(age, breaks=c(0,1,10,25,50,60))))

enter image description here

score 1 · Accepted Answer · edited May 23 '17 at 12:11

Your question would have been better if you had provided your data (or a representative subset) and showed what you had tried so far. See this post. Absent that, this makes a lot of assumptions.

Assuming a data frame, df, that identifies children by age, and for each of 5 diseases whether that child has the disease. Each row is a child, and df$age is the age, and df$Xn is 1 if the child has disease Xn, 0 otherwise. In this example we use 1,000,000 children randomly "assigned" diseases at a 5% rate. Then,

set.seed(1)    # for reproducible example
children <- data.frame(age=sample(1:60,1e6, replace=T), 
                 matrix(sample(0:1,5e6, replace=T, p=c(0.95,0.05)),nc=5))

# you start here...
disease.rates <- function(data,breaks) {
  cuts      <- cut(data$age,breaks)
  get.rates <- function(df) sapply(df,function(col) sum(col==1)/length(col))
  rate      <- sapply(split(data[-1],cuts),get.rates)
  data.frame(t(rate))
}
# by year
disease.rates(children,breaks=c(0,12*1:5))
##              X1      X2      X3      X4      X5
## (0,12]  0.04927 0.05027 0.04916 0.05049 0.05074
## (12,24] 0.04965 0.04957 0.04970 0.05044 0.04982
## (24,36] 0.05032 0.05065 0.05044 0.05036 0.05024
## (36,48] 0.04962 0.05079 0.04984 0.04895 0.04981
## (48,60] 0.05103 0.05012 0.04922 0.04986 0.04942

# more detail in first year
disease.rates(children,breaks=c(0,1,2,4,6,12,60))
##              X1      X2      X3      X4      X5
## (0,1]   0.04780 0.04949 0.04846 0.04968 0.05198
## (1,2]   0.04891 0.04808 0.04909 0.05212 0.05236
## (2,4]   0.04943 0.04797 0.04740 0.05113 0.05110
## (4,6]   0.04980 0.05143 0.05004 0.05086 0.05189
## (6,12]  0.04935 0.05116 0.04959 0.05002 0.04977
## (12,60] 0.05016 0.05028 0.04980 0.04990 0.04982

Thank you for putting in the time and effort to craft a thorough example. I am just getting started in R, so I apologize for the vague question and lack of data sample. Thank you, nonetheless. — mcjudd, Aug 13 '14 at 14:36

What is an efficient way to calculate a category-specific metric for a variety of levels of variable granularity?

2 Answers2