0

I am conducting an analysis where I am calculating a rate based partly on a rolling sum of days. I am performing this calculation using dplyr and group_by/summary/mutate operations.

However, the increments of the rolling sum of days differs by group. Ideally, I have a measurement every 30 days for a year. However, sometimes the measurements come 60 or 90 days apart.

For example:

df <- data.frame( ID = "Subject A",
                 cumulative_days = c(30, 60, 90, 180, 270, 360),
                 rolling_percent = c(.8, .6, .6, .4, .3, .2))

I want to turn this group in something like:

result <- data.frame(ID = "Subject A",
                     month = seq(1,12),
                 rolling_percent = c(.8, .6, .6, NA, NA, .4, NA, NA, .3, NA, NA, .2))

If I am able to reach the 'result' dataframe above, my plan is to leverage the dplyr/zoo solution described here: fill in NA based on the last non-NA value for each group in R

Where I can fill in the NAs with the last non-NA observation.

In other words, I want to be able to turn N observations which cumulatively add up to a rolling sum of 360 into 12 observations. At that point, I believe I can apply the other linked solution to solve my problem.

I'm struggling to clearly describe this situation, so any advice on clarifying my problem would be appreciated.

Community
  • 1
  • 1
mcharl02
  • 128
  • 1
  • 12

3 Answers3

2
library(data.table)
dt = as.data.table(df) # or setDT to convert in place

dt[, .(ID, month = cumulative_days/30, rolling_percent)][
   CJ(ID = unique(ID), month = 1:12), on = c('ID', 'month')]
#           ID month rolling_percent
# 1: Subject A     1             0.8
# 2: Subject A     2             0.6
# 3: Subject A     3             0.6
# 4: Subject A     4              NA
# 5: Subject A     5              NA
# 6: Subject A     6             0.4
# 7: Subject A     7              NA
# 8: Subject A     8              NA
# 9: Subject A     9             0.3
#10: Subject A    10              NA
#11: Subject A    11              NA
#12: Subject A    12             0.2

# or simply make it a rolling join to achieve your desired final result
dt[, .(ID, month = cumulative_days/30, rolling_percent)][
   CJ(ID = unique(ID), month = 1:12), on = c('ID', 'month'), roll = T]
#           ID month rolling_percent
# 1: Subject A     1             0.8
# 2: Subject A     2             0.6
# 3: Subject A     3             0.6
# 4: Subject A     4             0.6
# 5: Subject A     5             0.6
# 6: Subject A     6             0.4
# 7: Subject A     7             0.4
# 8: Subject A     8             0.4
# 9: Subject A     9             0.3
#10: Subject A    10             0.3
#11: Subject A    11             0.3
#12: Subject A    12             0.2

Alternatively to the above selection of columns, you can just add a new month column:

dt[, month := cumulative_days/30][
   CJ(ID = unique(ID), month = 1:12), on = c('ID', 'month'), roll = T]
#           ID cumulative_days rolling_percent month
# 1: Subject A              30             0.8     1
# 2: Subject A              60             0.6     2
# 3: Subject A              90             0.6     3
# 4: Subject A              90             0.6     4
# 5: Subject A              90             0.6     5
# 6: Subject A             180             0.4     6
# 7: Subject A             180             0.4     7
# 8: Subject A             180             0.4     8
# 9: Subject A             270             0.3     9
#10: Subject A             270             0.3    10
#11: Subject A             270             0.3    11
#12: Subject A             360             0.2    12
eddi
  • 49,088
  • 6
  • 104
  • 155
  • Wow. Really stunning how fast this is compared to @adiana 's - much appreciated - solution, at least at the scale of the actual dataset I'm using. Applying that other function over a list of DFs was taking on the order of 10 minutes. This solution was instantaneous. Gotta find more ways to use data.table in future. – mcharl02 Apr 15 '16 at 16:52
1

This is a solution joining your data.frame with a complete one

library(dplyr)
df$month<-df$cumulative_days/30
result<-data.frame(ID = "Subject A",month=seq(1,max(df$month))) %>% left_join(df) %>%
select(-cumulative_days)

If you want to apply your solution to different IDs, like in this fake dataset:

df <- data.frame( ID = "Subject A",
              cumulative_days = c(30, 60, 90, 180, 270, 360),
              rolling_percent = c(.8, .6, .6, .4, .3, .2))

df2 <- data.frame( ID = "Subject B",
              cumulative_days = c(30, 90, 120, 180, 270, 360),
              rolling_percent = c(.6, .4, .3, .2, .1, .6))

df<-rbind(df,df2)

you can declare the previous code as a function, then splitting the big dataframe based on the IDs and apply the function singularly, binding all together at the end. So the code would be like:

buildDf<-function(df){
 df$month<-df$cumulative_days/30
 data.frame(ID = df$ID[1],month=seq(1,max(df$month))) %>% 
 left_join(df) %>% select(-cumulative_days)
}

listDf<-split(df,f=df$ID)
listDfFiltered<-lapply(listDf,buildDf)
result<-do.call('rbind',listDfFiltered)

Hope this helps

adaien
  • 1,932
  • 1
  • 12
  • 26
  • Thank you! This does exactly what I would need for the example I provided. Can you suggest a way to extend this for multiple subjects, i.e. where df may contain "Subject A", "Subject B", etc etc. – mcharl02 Apr 14 '16 at 22:36
  • `dfA <- data.frame( ID = "Subject A", cumulative_days = c(30, 60, 90, 180, 270, 360), rolling_percent = c(.8, .6, .6, .4, .3, .2)); dfB <- data.frame( ID = "Subject B", cumulative_days = c(30, 90, 180, 360), rolling_percent = c(.5, .6, .6, .4)); df_all <- rbind(dfA, dfB)` – mcharl02 Apr 15 '16 at 13:48
1

We can do this with base R. Create the 'month' column by dividing by 30. Then, use expand.grid to get a data.frame with all the combinations of 'ID' and the range of 'month,merge` with the original dataset so that we get NA for 'rolling_percent' for 'ID', 'month' combination that is not found in the 'df'.

df$month <-df$cumulative_days/30
merge(expand.grid(ID = unique(df$ID), 
       month=Reduce(`:`, range(df$month))), df[-2], all.x=TRUE)
#          ID month rolling_percent
#1  Subject A     1             0.8
#2  Subject A     2             0.6
#3  Subject A     3             0.6
#4  Subject A     4              NA
#5  Subject A     5              NA
#6  Subject A     6             0.4
#7  Subject A     7              NA
#8  Subject A     8              NA
#9  Subject A     9             0.3
#10 Subject A    10              NA
#11 Subject A    11              NA
#12 Subject A    12             0.2
akrun
  • 874,273
  • 37
  • 540
  • 662