I am working with a large health insurance dataset and I am interested in participants with certain claims codes. One of my inclusion criteria is that the participant has to have be insured for one year before and one year after the claim date. E.g., if they were injured 9/27/2017, they need insurance from 9/27/2016-9/27/2018.
I have tried doing a simple rowsum, and using apply, but both have the same issue: in from:to : numerical expression has # elements: only the first used
. Right now, I have the range saved as variables in the dataframe. It think I understand why I am having the issue--it is expecting a number and receiving a vector. How can I get it to conditionally select columns to sum. I will include my code below.
In my example, I am just trying to count the number of months a participant is insured for 6 month before and after their accident. The ins_#_#
variables are a simple YES/NO for whether or not participants were insured that month. Any guidance is appreciated!
library(tidyverse)
set.seed(1)
df <- data.frame(id= seq(1,100),
injury_date = sample(seq(as.Date('2017/01/01'), as.Date('2017/12/31'), by="day"), 100),
ins_07_16 = sample(c(0,1), replace = TRUE),
ins_08_16 = sample(c(0,1), replace = TRUE),
ins_09_16 = sample(c(0,1), replace = TRUE),
ins_10_16 = sample(c(0,1), replace = TRUE),
ins_11_16 = sample(c(0,1), replace = TRUE),
ins_12_16 = sample(c(0,1), replace = TRUE),
ins_01_17 = sample(c(0,1), replace = TRUE),
ins_02_17 = sample(c(0,1), replace = TRUE),
ins_03_17 = sample(c(0,1), replace = TRUE),
ins_04_17 = sample(c(0,1), replace = TRUE),
ins_05_17 = sample(c(0,1), replace = TRUE),
ins_06_17 = sample(c(0,1), replace = TRUE),
ins_07_17 = sample(c(0,1), replace = TRUE),
ins_08_17 = sample(c(0,1), replace = TRUE),
ins_09_17 = sample(c(0,1), replace = TRUE),
ins_10_17 = sample(c(0,1), replace = TRUE),
ins_11_17 = sample(c(0,1), replace = TRUE),
ins_12_17 = sample(c(0,1), replace = TRUE),
ins_01_18 = sample(c(0,1), replace = TRUE),
ins_02_18 = sample(c(0,1), replace = TRUE),
ins_03_18 = sample(c(0,1), replace = TRUE),
ins_04_18 = sample(c(0,1), replace = TRUE),
ins_05_18 = sample(c(0,1), replace = TRUE),
ins_06_18 = sample(c(0,1), replace = TRUE))
df <- df %>%
mutate(month = as.numeric(format(as.Date(injury_date), "%m")), #pulling month of injury
low_mo = month + 2,
high_mo = month + 14)
df$insured <- rowSums(df[df$low_mo:df$high_mo]) #only uses first element
df$insured <- apply(df[df$low_mo:df$high_mo], 1, sum) #only uses first element
Edit: Although I did not specify that I wanted a fast solution, I am working with a lot of data so I tested which of @akrun's solutions was the fastest. I changed the dataframe so it was 1e5 (100,000) rows. The results are below in case anyone is curious.
microbenchmark(o1 <- sapply(seq_len(nrow(df)), function(i) sum(df[i, df$low_mo[i]:df$high_mo[i]])),
o2 <- {colInd <- Map(`:`, df$low_mo, df$high_mo);
rowInd <- rep(seq_len(nrow(df)), lengths(colInd));
as.vector(tapply(df[-(1:2)][cbind(rowInd, unlist(colInd)-2)],
rowInd, FUN = sum))},
o3 <- {colInd1 <- Map(function(x, y) which(!seq_along(df) %in% x:y), df$low_mo, df$high_mo);
rowInd1 <- rep(seq_len(nrow(df)), lengths(colInd1));
rowSums(replace(df, cbind(rowInd1, unlist(colInd1)), NA)[-(1:2)], na.rm = TRUE)},
times = 5)
Unit: milliseconds
expr min lq mean median uq max neval
o1 20408.5072 20757.0285 20903.9386 20986.2275 21069.3163 21298.6137 5
o2 433.5463 436.3066 448.6448 455.6551 456.8836 460.8325 5
o3 470.6834 482.4449 492.9594 485.6210 504.1353 521.9122 5
> identical(o1, o2)
[1] TRUE
> identical(o2, o3)
[1] TRUE