3

Lets suppose I have a data frame:

    User    Date
    aaaa    2015-11-26
    aaaa    2015-12-26
    aaaa    2016-01-26
    bbbb    2014-10-15
    bbbb    2014-11-15
    bbbb    2015-05-16

And I want to generate a new column variable(s) that:

  1. Count a the unique user in sequence
  2. If there is a gap in the month to month, count how many months the gap is.

Desired ouput:

    User    Date        Count  Gap
    aaaa    2015-11-26    1     0
    aaaa    2015-12-26    2     0
    aaaa    2016-01-26    3     0
    bbbb    2014-10-15    1     0
    bbbb    2014-11-15    2     0
    bbbb    2015-05-16    3     6
JasonAizkalns
  • 20,243
  • 8
  • 57
  • 116
Kyle Billings
  • 57
  • 2
  • 7
  • 1
    What if the sequence of User was `c("aaaa", "aaaa", "bbbb", "aaaa", "bbbb", "bbbb")`? Should the result be the same? Or does that start a new sequence for User `aaaa`? – talat Apr 06 '16 at 14:53
  • No, it the sequence was `a,a,b,a,b,b` - the count should return as `1,2,1,3,2,3` - the actual data frame has thousands of unique users and tens of thousands of observations. I am assuming that they are all mixed - although I could easily group by unique user if need be. – Kyle Billings Apr 06 '16 at 14:57
  • This looks like an option: http://stackoverflow.com/q/22287062 For the "count" thing in dplyr, just do group_by and row_number. – Frank Apr 06 '16 at 14:59
  • so my last attempt - I will worry about the month gap next - is `foo %>% group_by(user) %>% arrange(date) %>% mutate(seqnum = 1:length(user_id)` - however, and I have not seem this happen, there are new added attributes to the data frame and now I cannot seem to get rid of them... – Kyle Billings Apr 06 '16 at 15:11
  • got it. silly little problem. now to figure out (B) – Kyle Billings Apr 06 '16 at 15:15

1 Answers1

4

Using zoo::as.yearmon(), however, I had to round because otherwise 2015-11-26 to 2015-12-26 is considered longer than one month. Perhaps someone can comment/edit/explain how to make that particular calculation more "intuitive".

library(dplyr)
library(zoo)

df %>%
  group_by(User) %>%
  mutate(Count = 1:n(),
         Gap_In_Months = round(12 * as.numeric(as.yearmon(Date) - as.yearmon(lag(Date))), 1),
         Gap = ifelse(Gap_In_Months <= 1 | is.na(Gap_In_Months), 0, Gap_In_Months))

#     User       Date Count Gap_In_Months   Gap
#   (fctr)     (fctr) (int)         (dbl) (dbl)
# 1   aaaa 2015-11-26     1            NA     0
# 2   aaaa 2015-12-26     2             1     0
# 3   aaaa 2016-01-26     3             1     0
# 4   bbbb 2014-10-15     1            NA     0
# 5   bbbb 2014-11-15     2             1     0
# 6   bbbb 2015-05-16     3             6     6

Perhaps you want to be more specific as to "what is a month"? 30 days? 31 days? 28 days?

If that's the case, we can utilize lubridate:

library(lubridate)

df %>%
  group_by(User) %>%
  mutate(Count = 1:n(),
         Diff_Time = ymd(Date) - ymd(lag(Date)),
         Gap = ifelse(Diff_Time <= ddays(31) | is.na(Diff_Time), 0, as.numeric(Diff_Time, units = "days")))

#     User       Date Count Diff_Time   Gap
#   (fctr)     (fctr) (int)    (dfft) (dbl)
# 1   aaaa 2015-11-26     1   NA days     0
# 2   aaaa 2015-12-26     2   30 days     0
# 3   aaaa 2016-01-26     3   31 days     0
# 4   bbbb 2014-10-15     1   NA days     0
# 5   bbbb 2014-11-15     2   31 days     0
# 6   bbbb 2015-05-16     3  182 days   182
harre
  • 7,081
  • 2
  • 16
  • 28
JasonAizkalns
  • 20,243
  • 8
  • 57
  • 116
  • Sweet! - I will give that a run. Really, by a month I am literally meaning the increase in the month integer. The day value is superfluous for what I am working on. – Kyle Billings Apr 06 '16 at 15:37
  • That's fine, but proceed with caution -- consider `2015-11-30` to `2015-12-01` -- the way this is written, that would be considered a one-month gap which may or may not be what you intend. All the more reason to keep the intermediate calculation (i.e., `Gap_In_Months` attached to the data) – JasonAizkalns Apr 06 '16 at 15:43