If I understood you well, here is a solution using the data.table
package. I have found two options (but the first one has better performance)
Convert the original data frame into data.table
object:
dt <- data.table(df) # Create a data table from the data frame
setorder(dt, patient_id, visit_date) # Sort by patient_id, then by visit_date
Define the week threshold parameter:
weekNum = 20L # Considering a threshold of: 20-weeks.
OPTION 1: Computing directly the number of weeks from visit_date
column
We define the following function that makes the calculation for each group:
visitFreq <- function(x) {
n <- length(x)
result <- numeric(n)
if (n > 1) {
for (i in 1:n) {
# For each row of the column by patient_id
ref <- x[i] # reference date
x.prev <- x[x < ref] # select previous dates
if (length(x.prev) > 0) {
x.prev <- sapply(x.prev, function(y) {
ifelse(difftime(ref, y, units = "weeks") <= weekNum, 1, 0)
})
result[i] <- sum(x.prev)
}
}
}
return(result)
}
For each x[i]
it finds the number of previous visits and then computes whether the previous dates are within the defined thershold or not. Then just left to count the number of previous visits before within the threshold.
Once we know how to make the calculation, we just need to apply this function for the visit_date
column for each patient_id
:
dt[, visits := visitFreq(visit_date), by = patient_id]
Note: The function visitFreq
has to be defined considering a vectorial function, that receives an array of visit_date
and should return an array of the same dimension.
OPTION 2: Creating an artificial variable that collects all visit_date for a given patient.
Now we need to create a function that makes the calculation for computing the number of weeks:
calc <- function(vec, x) {
vec.prev <- vec[vec < x] # Select all dates before x
n <- 0
if (length(vec.prev) > 0) {
vec.prev <- sapply(vec.prev, function(y) {
ifelse(difftime(x, y, units = "weeks") <= weekNum, 1, 0)
})
n <- sum(vec.prev)
}
return(n)
}
where:
vec
: Is an array of dates
x
: Is the reference date
We filter only by the dates previous to date x
. Now we apply the sapply
function for each element of vec
, for computing the difference in time between y
(each element of vec
) and the reference date x
using as units the number of weeks. The result will be 1
for any diff date less that weekNum
or zero. Then the number of previous visits less than certain number of weeks from reference date will be just counting all 1
we get.
Now we use this function in a data.table
object like this:
dt[, visits := .(list(visit_date)), by = patient_id]
[, visits := mapply(calc, visits, visit_date)][order(patient_id)][]
Let's explain it a little bit:
- We create a
visits
variable that is a list of all dates for a given patient_id
(because the by
clause).
If we execute the first expression it will produce something like this:
> dt[, visits := .(list(visit_date)), by = patient_id][]
visit_id patient_id visit_date visits
1: 1 1 2016-12-02 2016-12-02,2016-12-15,2016-12-30,2017-02-15
2: 4 1 2016-12-15 2016-12-02,2016-12-15,2016-12-30,2017-02-15
3: 3 1 2016-12-30 2016-12-02,2016-12-15,2016-12-30,2017-02-15
4: 7 1 2017-02-15 2016-12-02,2016-12-15,2016-12-30,2017-02-15
5: 2 2 2016-12-02 2016-12-02,2017-02-01
6: 6 2 2017-02-01 2016-12-02,2017-02-01
7: 5 3 2016-12-30 2016-12-30
8: 8 4 2017-02-10 2017-02-10
9: 9 5 2017-01-15 2017-01-15
10: 10 6 2017-03-01 2017-03-01
>
- The second statement (second
[]
-block) just do the calculation re-assigning the previously created variable visits
, but now counting the number or previous visits with respect the reference date. We need the mapply
function to make the vectorial computation, on each invocation of cal
function we have as input arguments: dt[i]$visits
(a list) and the corresponding dt[i]$visit_date[i]
. mapply
just iterates over all i
-elements invoking the function calc
.
RESULT
Finally, the result will be:
> dt
visit_id patient_id visit_date visits
1: 1 1 2016-12-02 0
2: 4 1 2016-12-15 1
3: 3 1 2016-12-30 2
4: 7 1 2017-02-15 3
5: 2 2 2016-12-02 0
6: 6 2 2017-02-01 1
7: 5 3 2016-12-30 0
8: 8 4 2017-02-10 0
9: 9 5 2017-01-15 0
10: 10 6 2017-03-01 0
>
and I guess this is what you wanted.
Note: Probably it would be a way to get the calculation on the fly but I was not able to see how. Perhaps other folks can suggest a slightly more syntactically succinct way.
PERFORMANCE
I was wondering about which option has better performance (I expected the OPC1), let's check it:
library(microbenchmark)
op <- microbenchmark(
OP1 = copy(dt)[, visits := visitFreq(visit_date), by = patient_id],
OP2 = copy(dt)[, visits := .(list(visit_date)), by = patient_id][, visits := mapply(calc, visits, visit_date)],
times=100L)
print(op)
It produce the following output:
Unit: milliseconds
expr min lq mean median uq max neval cld
OP1 3.467451 3.552916 4.165517 3.642150 4.200413 7.96348 100 a
OP2 4.732729 4.832695 5.799648 5.063985 6.073467 13.17264 100 b
>
Therefore the first option has the best performance.
EDIT (added the solution proposed by: @DavidArenburg)
Let's include as the third option the join solution, but increasing the size of the input argument repeating the input vector, for example:
nSample <- 100
patient_id <- rep(c(1, 2, 1, 1, 3, 2, 1, 4, 5, 6), nSample)
visit_id <- 1:nSample
visit_date <- rep(as.Date(c('2016-12-02', '2016-12-02', '2016-12-30',
'2016-12-15', '2016-12-30', '2017-02-01',
'2017-02-15', '2017-02-10', '2017-01-15', '2017-03-01')), nSample)
df <- data.frame(visit_id, patient_id, visit_date)
opc3 <- function(df) {
df[, visit_date20 := visit_date - 20 * 7] # Create a 20 weeks boundry
## Count previous visits within the range
df[df, .(visits = .N),
on = .(patient_id, visit_date < visit_date, visit_date > visit_date20),
by = .EACHI]
}
dt <- data.table(df)
dt3 <- copy(dt)[, visit_date := as.IDate(visit_date)] # Convert visit_date to a proper Date class
library(microbenchmark)
op <- microbenchmark(
OP1 = copy(dt)[, visits := visitFreq(visit_date), by = patient_id],
OP2 = copy(dt)[, visits := .(list(visit_date)), by = patient_id][, visits := mapply(calc, visits, visit_date)],
OP3 = opc3(copy(dt3)),
times = 10L)
print(op)
I get the following results:
Unit: milliseconds
expr min lq mean median uq max neval cld
OP1 6315.73724 6485.111937 10744.808669 11789.230998 15062.957734 15691.445961 10 b
OP2 6266.80130 6431.330087 11074.441187 11773.459887 13928.861934 15335.733525 10 b
OP3 2.38427 2.845334 5.157246 5.383949 6.711482 8.596792 10 a
>
The @DavidArenburg solution scale much better when the number of rows increse.