1

I'm relatively new to R and have used stackoverflow to solve a number of problems, but this is the first time I can't find a straightforward solution. Would really appreciate help on this.

PROBLEM (simplified): I have a data frame with a field for account.id and a field for start.date. There may be multiple records per day. I want to create a third field called sequential.days which reports the number of sequential days a user (account.id) has been active.

DETAILS: The data is already sorted by account.id and then by start.date. There can be multiple records per day. If there are multiple records in a single day, I would like sequential.days to populate with a value of 1.

Thanks.

Here is a "working example" -- meaning that it includes an illustration of the data and the field I'd like to generate.

id <- c(1030, 1030, 1030, 1030, 2022, 2022, 2022, 2022, 3045, 3045, 3045, 3045)
date <- c('2013-01-01', '2013-01-01', '2013-01-02', '2013-02-04', '2013-02-01', '2013-02-02', '2013-02-02', '2013-01-04', '2013-05-01', '2013-06-01',  '2013-07-01', '2013-07-01')
sequential.days <- c(1,1,2,1,1,2,2,1,1,1,1,1)
df <- cbind(id, date, sequential.days)
Matthew Plourde
  • 43,932
  • 7
  • 96
  • 113

1 Answers1

0

here is a data.table solution:

# convert to data.table
library(data.table)
DT <- data.table(df)

# make sure `date` is in fact a date and not a string
DT[, date := as.Date(date)]

# re order
DT <- setkey(DT[order(date)], id)

# compute diffs
DT[, diffs := c(0, diff(date)), by=id]

## We will use cumsum.  Anything greater than 1, should be reset to 0
DT[diffs > 1, diffs := 0]

# add one to each value
DT[, diffs := diffs + 1]

# fix duplicate dates
DT[, diffs := max(diffs), by=list(id, date)]

RESULTS

DT

      id       date sequential.days diffs
 1: 1030 2013-01-01               1     1
 2: 1030 2013-01-01               1     1
 3: 1030 2013-01-02               2     2
 4: 1030 2013-02-04               1     1
 5: 2022 2013-01-04               1     1
 6: 2022 2013-02-01               1     1
 7: 2022 2013-02-02               2     2
 8: 2022 2013-02-02               2     2
 9: 3045 2013-05-01               1     1
10: 3045 2013-06-01               1     1
11: 3045 2013-07-01               1     1
12: 3045 2013-07-01               1     1
Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
  • When I use this code it produces accurate results for up to two sequential days, but it doesn't work for anything more than two. In other words, the "diff" value seems to max out at 2. Any idea why or how to fix it? – user2537366 Jul 02 '13 at 18:58
  • hmm... im guessing probably need to look at the diff of diffs and see if that's greater than 1. If that doesnt work, can you please update the OP with sample data that causes this issue – Ricardo Saporta Jul 02 '13 at 19:01
  • please see my link in the comments above for how to reproduce your data quickly – Ricardo Saporta Jul 02 '13 at 19:02