Data Frame: how to add a column to count previous occurrences in the last 20 weeks?

Question

This is my first question here in a long time :).
I've got a data frame with data about patient visits to a clinic.

visit_id <- c(1,2,3,4,5,6,7,8,9,10)
patient_id <- c(1,2,1,1,3,2,1,4,5,6)
visit_date <- as.Date(c('2016-12-02','2016-12-02','2016-12-30',
'2016-12-15','2016-12-30','2017-02-01',
'2017-02-15','2017-02-10','2017-01-15','2017-03-01'))
df <- data.frame(visit_id,patient_id,visit_date,visits_previous_20_weeks)

It looks like this:

        visit_id patient_id visit_date
1         1          1 2016-12-02
2         2          2 2016-12-02
3         3          1 2016-12-30
4         4          1 2016-12-15
5         5          3 2016-12-30
6         6          2 2017-02-01
7         7          1 2017-02-15
8         8          4 2017-02-10
9         9          5 2017-01-15
10       10          6 2017-03-01

I want to add one more column that would indicate the number of times the patient has been to the clinic in the last 20 weeks:

   visit_id patient_id visit_date visits_previous_20_weeks
1         1          1 2016-12-02                        0
2         2          2 2016-12-02                        0
3         3          1 2016-12-30                        2
4         4          1 2016-12-15                        1
5         5          3 2016-12-30                        0
6         6          2 2017-02-01                        1
7         7          1 2017-02-15                        3
8         8          4 2017-02-10                        0
9         9          5 2017-01-15                        0
10       10          6 2017-03-01                        0

The only data source is this table. So in the beginning of the table, since this is the first record, patient 1 has been to the clinic 0 times. But on the December 15th, 2016, the patient comes back to the clinic. So the number of visits in the previous 20 weeks (as of that date) is 1. One inefficient way to do this would be to create a loop that for each row in the data frame, would go through the whole data frame and tally the number of visits for same patient in the previous 20 weeks. Any better way to do this in R?

Thanks :)

score 4 · Answer 1 · answered Mar 14 '17 at 04:20

4

Here's a way using the data.table package. What this basically doing is to first create a 20 week boundary column and then perform an a non-equi self join while counting the matches.

library(data.table)
setDT(df)[, visit_date := as.IDate(visit_date)] # Convert visit_date to a proper Date class
df[, visit_date20 := visit_date - 20*7] # Create a 20 weeks boundry 

## Count previous visits within the range
df[df, .(Visits = .N), 
       on = .(patient_id, visit_date < visit_date, visit_date > visit_date20), 
       by = .EACHI]
#     patient_id visit_date visit_date Visits
#  1:          1 2016-12-02 2016-07-15      0
#  2:          2 2016-12-02 2016-07-15      0
#  3:          1 2016-12-30 2016-08-12      2
#  4:          1 2016-12-15 2016-07-28      1
#  5:          3 2016-12-30 2016-08-12      0
#  6:          2 2017-02-01 2016-09-14      1
#  7:          1 2017-02-15 2016-09-28      3
#  8:          4 2017-02-10 2016-09-23      0
#  9:          5 2017-01-15 2016-08-28      0
# 10:          6 2017-03-01 2016-10-12      0

answered Mar 14 '17 at 04:20

David Arenburg

91,361
17
137
196

very concise solution! I was trying to compile it but I get the following error: `could not find function "."`, for the line: `df[df, ..]`. Do you have any suggestion about that? Thanks. – David Leal Mar 14 '17 at 06:16
@DavidLeal did you convert `df` to a data.table? Also, what data.table version are you using? Try updating to the latest. – David Arenburg Mar 14 '17 at 07:04
Probably it was late yesterday, I got this error on my mac, now I run it again on my PC and it works. Thanks. – David Leal Mar 14 '17 at 13:30
I updated the performance test I have provided in my answer including your solution and still the option 1 I proposed has better performance. At the end of the day the join operation is time consuming, even though the `data.table` seeks to improve performance – David Leal Mar 14 '17 at 14:41
@DavidLeal What is the size of the data you are testing on? – David Arenburg Mar 14 '17 at 14:46
I am using the original data from the question. I just repeated the test 100 times with `microbenchmark` package. I left more information about the test at the end of my answer. – David Leal Mar 14 '17 at 14:54
@DavidLeal It doesn't make sense to benchmark on a 10 rows data set. It doesn't scale. You should benchmark on atleast 1e5 if not 1e6 rows. Here's an example how make a reasonable benchmark http://stackoverflow.com/questions/24070714/extract-row-corresponding-to-minimum-value-of-a-variable-by-group/41838383#41838383 – David Arenburg Mar 14 '17 at 14:55
1

it is true, I tested increasing the input data x100 and your solution scales much better (5 milisec vs 10744 milisec) . I will test it for a 1e6 rows in a more powerful computer, but I suspect the result will be the same. Good tip. Thanks. I will update my answer with this feedback. – David Leal Mar 14 '17 at 15:30
looking at your solution, I don't understand for example the following: `visit_date < visit_date` (both variable are the same with the same values `dt`). Is it `(Visits = .N)` equivalent to: `Visits :=.N`?, it produces the same result, but I am not sure. Can you explain a little bit the syntax behind. Thanks. – David Leal Mar 14 '17 at 19:55
@DavidLeal I'm doing a self join, hence the column names are the same. You should probably read the docs regarding `on`. Also, `.(Visits = .N)` is *not* equivalent to `Visits :=.N`. `:=` is assignment by reference while `.()` isn't. – David Arenburg Mar 14 '17 at 20:21
OK, I will do. I am starting with `data.table`. I read already the package [doc](https://cran.r-project.org/web/packages/data.table/data.table.pdf) and [this](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-secondary-indices-and-auto-indexing.html), etc. about secondary index related with the `on`. It is an advance concept that I am not so much familiar yet. Usually the examples cover very simple cases. Thx. – David Leal Mar 14 '17 at 20:38
1

@DavidLeal Secondary indexes are indirectly related to your question. `on` is pretty much the same as `by` in base function `merge`- there is nothing too complex about it. It just has some additional functuality such as specifying different column names, NSE (pass unquoted column names) and most importantly, it allows *non-equi* joins (which isn't very complecated concept really). Maybe take a look at this lecture by @Arun (one of the data.table authors) https://www.youtube.com/watch?v=P_6SAZn6IbM – David Arenburg Mar 14 '17 at 20:53
How are the entries listed? Is there one for each window, or one from each visit? – Mason Wang May 07 '20 at 20:46
What is the best way to add Visits as assignment by reference (to create a new column Visits in df) instead, to avoid having to rejoin the new table with df? – Ankhnesmerira Sep 01 '20 at 06:16
@Ankhnesmerira perhaps `Visits := .N` ? – David Arenburg Sep 01 '20 at 06:20
that's what I thought, but it generates extra columns! (my columns suddenly more than double – Ankhnesmerira Sep 01 '20 at 06:23
1

@Ankhnesmerira Then just extract the column and assign it, something like `df[, Visits := df[df, .N, on = .(patient_id, visit_date < visit_date, visit_date > visit_date20), by = .EACHI]$N]` – David Arenburg Sep 01 '20 at 06:42
For very large datasets (eg. 200-500M rows), is this still the best solution? (merging a very large dataset with itself?) – Ankhnesmerira Sep 01 '20 at 07:55
@Ankhnesmerira I don't know about *"best"* solutions. This is *my* solution. You can try different things and compare. In general, when you are doing non-equi joins, it is always a complicated operation. – David Arenburg Sep 01 '20 at 08:02

David Leal · Answer 2 · 2017-03-14T15:36:42.437

If I understood you well, here is a solution using the data.table package. I have found two options (but the first one has better performance)

Convert the original data frame into data.table object:

dt <- data.table(df) # Create a data table from the data frame
setorder(dt, patient_id, visit_date) # Sort by patient_id, then by visit_date

Define the week threshold parameter:

weekNum = 20L # Considering a threshold of: 20-weeks.

OPTION 1: Computing directly the number of weeks from visit_datecolumn

We define the following function that makes the calculation for each group:

visitFreq <- function(x) {
    n <- length(x)
    result <- numeric(n)
    if (n > 1) {
        for (i in 1:n) {
            # For each row of the column by patient_id
            ref <- x[i] # reference date
            x.prev <- x[x < ref] # select previous dates
            if (length(x.prev) > 0) {
                x.prev <- sapply(x.prev, function(y) {
                    ifelse(difftime(ref, y, units = "weeks") <= weekNum, 1, 0)
                })
                result[i] <- sum(x.prev)
            }
        }
    }
    return(result)
}

For each x[i] it finds the number of previous visits and then computes whether the previous dates are within the defined thershold or not. Then just left to count the number of previous visits before within the threshold.

Once we know how to make the calculation, we just need to apply this function for the visit_datecolumn for each patient_id:

dt[, visits := visitFreq(visit_date), by = patient_id]

Note: The function visitFreqhas to be defined considering a vectorial function, that receives an array of visit_dateand should return an array of the same dimension.

OPTION 2: Creating an artificial variable that collects all visit_date for a given patient.

Now we need to create a function that makes the calculation for computing the number of weeks:

calc <- function(vec, x) {
    vec.prev <- vec[vec < x] # Select all dates before x
    n <- 0
    if (length(vec.prev) > 0) {
        vec.prev <- sapply(vec.prev, function(y) {
            ifelse(difftime(x, y, units = "weeks") <= weekNum, 1, 0)
        })
        n <- sum(vec.prev)
    }
    return(n)
}

where:

vec: Is an array of dates
x : Is the reference date

We filter only by the dates previous to date x. Now we apply the sapply function for each element of vec, for computing the difference in time between y (each element of vec) and the reference date x using as units the number of weeks. The result will be 1 for any diff date less that weekNum or zero. Then the number of previous visits less than certain number of weeks from reference date will be just counting all 1 we get.

Now we use this function in a data.table object like this:

dt[, visits := .(list(visit_date)), by = patient_id]
    [, visits := mapply(calc, visits, visit_date)][order(patient_id)][]

Let's explain it a little bit:

We create a visits variable that is a list of all dates for a given patient_id (because the by clause).

If we execute the first expression it will produce something like this:

> dt[, visits := .(list(visit_date)), by = patient_id][]
    visit_id patient_id visit_date                                      visits
 1:        1          1 2016-12-02 2016-12-02,2016-12-15,2016-12-30,2017-02-15
 2:        4          1 2016-12-15 2016-12-02,2016-12-15,2016-12-30,2017-02-15
 3:        3          1 2016-12-30 2016-12-02,2016-12-15,2016-12-30,2017-02-15
 4:        7          1 2017-02-15 2016-12-02,2016-12-15,2016-12-30,2017-02-15
 5:        2          2 2016-12-02                       2016-12-02,2017-02-01
 6:        6          2 2017-02-01                       2016-12-02,2017-02-01
 7:        5          3 2016-12-30                                  2016-12-30
 8:        8          4 2017-02-10                                  2017-02-10
 9:        9          5 2017-01-15                                  2017-01-15
10:       10          6 2017-03-01                                  2017-03-01
>

The second statement (second []-block) just do the calculation re-assigning the previously created variable visits, but now counting the number or previous visits with respect the reference date. We need the mapply function to make the vectorial computation, on each invocation of cal function we have as input arguments: dt[i]$visits(a list) and the corresponding dt[i]$visit_date[i]. mapply just iterates over all i-elements invoking the function calc.

RESULT

Finally, the result will be:

> dt
    visit_id patient_id visit_date visits
 1:        1          1 2016-12-02      0
 2:        4          1 2016-12-15      1
 3:        3          1 2016-12-30      2
 4:        7          1 2017-02-15      3
 5:        2          2 2016-12-02      0
 6:        6          2 2017-02-01      1
 7:        5          3 2016-12-30      0
 8:        8          4 2017-02-10      0
 9:        9          5 2017-01-15      0
10:       10          6 2017-03-01      0
>

and I guess this is what you wanted.

Note: Probably it would be a way to get the calculation on the fly but I was not able to see how. Perhaps other folks can suggest a slightly more syntactically succinct way.

PERFORMANCE

I was wondering about which option has better performance (I expected the OPC1), let's check it:

library(microbenchmark)
op <- microbenchmark(
    OP1 = copy(dt)[, visits := visitFreq(visit_date), by = patient_id],
    OP2 = copy(dt)[, visits := .(list(visit_date)), by = patient_id][, visits := mapply(calc, visits, visit_date)],
    times=100L)
print(op)

It produce the following output:

Unit: milliseconds
 expr      min       lq     mean   median       uq      max neval cld
  OP1 3.467451 3.552916 4.165517 3.642150 4.200413  7.96348   100  a 
  OP2 4.732729 4.832695 5.799648 5.063985 6.073467 13.17264   100   b
>

Therefore the first option has the best performance.

EDIT (added the solution proposed by: @DavidArenburg)

Let's include as the third option the join solution, but increasing the size of the input argument repeating the input vector, for example:

nSample <- 100
patient_id <- rep(c(1, 2, 1, 1, 3, 2, 1, 4, 5, 6), nSample)
visit_id <- 1:nSample    
visit_date <- rep(as.Date(c('2016-12-02', '2016-12-02', '2016-12-30',
'2016-12-15', '2016-12-30', '2017-02-01',
'2017-02-15', '2017-02-10', '2017-01-15', '2017-03-01')), nSample)
df <- data.frame(visit_id, patient_id, visit_date)

opc3 <- function(df) {
    df[, visit_date20 := visit_date - 20 * 7] # Create a 20 weeks boundry 

    ## Count previous visits within the range
    df[df, .(visits = .N),
       on = .(patient_id, visit_date < visit_date, visit_date > visit_date20),
       by = .EACHI]
}

dt <- data.table(df)
dt3 <- copy(dt)[, visit_date := as.IDate(visit_date)] # Convert visit_date to a proper Date class

library(microbenchmark)
op <- microbenchmark(
    OP1 = copy(dt)[, visits := visitFreq(visit_date), by = patient_id],
    OP2 = copy(dt)[, visits := .(list(visit_date)), by = patient_id][, visits := mapply(calc, visits, visit_date)],
    OP3 = opc3(copy(dt3)),
    times = 10L)
    print(op)

I get the following results:

    Unit: milliseconds
 expr        min          lq         mean       median           uq          max neval cld
  OP1 6315.73724 6485.111937 10744.808669 11789.230998 15062.957734 15691.445961    10   b
  OP2 6266.80130 6431.330087 11074.441187 11773.459887 13928.861934 15335.733525    10   b
  OP3    2.38427    2.845334     5.157246     5.383949     6.711482     8.596792    10  a 
>

The @DavidArenburg solution scale much better when the number of rows increse.

Rahul · Answer 3 · 2017-03-13T14:50:32.727

0

How about this solution, using dplyr and lubridate?

library(lubridate)
no_of_weeks <- 4  #You want 20 here, but the result will be NULL for the example dataset you've given
df %>% 
  mutate(week_filter=visit_date<Sys.Date()-weeks(no_of_weeks)) %>% 
  group_by(patient_id) %>% 
  mutate(visits_previous_n_weeks=cumsum(week_filter)) %>%
  ungroup()

edited Mar 13 '17 at 14:50

answered Mar 13 '17 at 14:43

Rahul

2,579
1
13
22

why you use `Sys.Date()`? my understanding is that he wants to count the previous visits within the last `x`-weeks from the current visit (for each `visit_id` by a given patient). For example patient 1, has 4-visits. Then `cum_visits =[0,1,2,0]` for `visit_id=[1,4,3,7]`. Let's say for `x=4` (weeks). – David Leal Mar 13 '17 at 17:30
@DavidLeal good point. I think I misread the request. Right now, my code gives him 20 weeks from today. I can rewrite the code if TimmyOnRails wishes. – Rahul Mar 13 '17 at 19:17
Thanks @DavidLeal. Rahul, Would you be able to rewrite the code? – TimmyOnRails Mar 14 '17 at 04:03

Data Frame: how to add a column to count previous occurrences in the last 20 weeks?

3 Answers3

Linked