3

To make this question more generalized, I believe it could also be rephrased as: Creating a rolling temporally sensitive factor variable. Though an uncommon requirement, this could be utilized for many different data sources.

I have a series of non-uniform time data with > 1 record per day for thousands of users. I want to create a new column player_type that keeps track of a rolling 30 day definition of their behavior. The behavior is defined by what games they play; the column 'games' is a factor of gameA, gameB.

There are thus three types of behaviors:

  1. Exclusively plays GameA - 'A'
  2. Exclusively plays GameB - 'B'
  3. Plays both games - 'Hybrid'

I want to use this new column to see the changes in their play behavior over time, as well as counting the number of players in each group throughout time, to see how they change.

The time series is highly irregular for each player. Players can play multiple types of games per day, or not play any games for many months. The time series is irregular per player such that a record is only created when the player plays a game, thus I expect a solution might use a filter something like:

interval(current_date, current_date - new_period(days=30) (using lubridate).

Here is an example data set. Keep in mind this it is simplified and tests a rolling 1 day change, so simple methods checking the record before will not actually work. If you are able to make a better data set, please advise and I will edit this post.

p <- c( 1,   1,   1,   2,   2,   2,   6,   6,   6)

g <- c('A', 'B', 'B', 'A', 'B', 'A', 'A', 'B', 'B')

d <- seq(as.Date('2014-10-01'), as.Date('2014-10-9'), by=1)

df <- data.frame(player_id = p, date = d, games = g)

As output I require:

 player_id       date games   type
1         1 2014-10-01     A      A (OR NA)
2         1 2014-10-02     B Hybrid
3         1 2014-10-03     B      B
4         2 2014-10-04     A      A (OR NA)
5         2 2014-10-05     B Hybrid
6         2 2014-10-06     A Hybrid
7         6 2014-10-07     A      A (OR NA)
8         6 2014-10-08     B Hybrid
9         6 2014-10-09     B      B

The solution should be something like, apply through the columns, and apply a function which checks back 30 days in time, and an ifelse() statement to see what games they played.

This is a very similar post - and should help solve this problem. How do I do a conditional sum which only looks between certain date criteria

I have also explored, rowwise() and conditional mutates() using dplyr, however the catch is the historical time component for me.

Thanks for all the help! I can't thank this forum enough. I'll be checking back frequently.

Community
  • 1
  • 1
Ryan Kelly
  • 33
  • 4
  • In your real data, do you need to check for each row, all of the previous 30 rows or do you need to check the previous row (as in this example) but the delta between rows will always be 30 days? – talat Nov 29 '14 at 21:20
  • I need to check the previous 30 days (by player). This could be many rows, or less then 30 rows. The delta between rows is not consistent. _They could play both games on the same day, or not play either game for months at a time._ – Ryan Kelly Nov 29 '14 at 21:37

1 Answers1

4

Assuming that I understood it right, here's a data.table way using foverlaps() function.

Create dt and set key as shown below:

dt <- data.table(player_id = p, games = g, date = d, end_date = d)
setkey(dt, player_id, date, end_date)

hybrid_index <- function(dt, roll_days) {
    ivals = copy(dt)[, date := date-roll_days]
    olaps = foverlaps(ivals, dt, type="any", which=TRUE)
    olaps[, val := dt$games[xid] != dt$games[yid]]
    olaps[, any(val), by=xid][(V1), xid]
}

We create a dummy data.table ivals (for intervals), and for each row, we specify the start and the end dates. Note that by specifying end_date identical as dt$end_date, we'll definitely have one match (and this is deliberate) - this'll give you the non-NA version you ask for.

[With some minor changes here, you can get the NA version, but I'll leave that to you (assuming this answer is right).]

With that we simply find which ranges from ivals overlaps with dt, for each player_id. We get the matching indices. From there it's straightforward. If a player's game is non-homogeneous, then we return the corresponding index of dt from hybrid_index. And we replace those indices with "hybrid".

# roll days = 1L
dt[, type := games][hybrid_index(dt, 1L), type := "hybrid"]
#    player_id games       date   end_date   type
# 1:         1     A 2014-10-01 2014-10-01      A
# 2:         1     B 2014-10-02 2014-10-02 hybrid
# 3:         1     B 2014-10-03 2014-10-03      B
# 4:         2     A 2014-10-04 2014-10-04      A
# 5:         2     B 2014-10-05 2014-10-05 hybrid
# 6:         2     A 2014-10-06 2014-10-06 hybrid
# 7:         6     A 2014-10-07 2014-10-07      A
# 8:         6     B 2014-10-08 2014-10-08 hybrid
# 9:         6     B 2014-10-09 2014-10-09      B

# roll days = 2L
dt[, type := games][hybrid_index(dt, 2L), type := "hybrid"]
#    player_id games       date   end_date   type
# 1:         1     A 2014-10-01 2014-10-01      A
# 2:         1     B 2014-10-02 2014-10-02 hybrid
# 3:         1     B 2014-10-03 2014-10-03 hybrid
# 4:         2     A 2014-10-04 2014-10-04      A
# 5:         2     B 2014-10-05 2014-10-05 hybrid
# 6:         2     A 2014-10-06 2014-10-06 hybrid
# 7:         6     A 2014-10-07 2014-10-07      A
# 8:         6     B 2014-10-08 2014-10-08 hybrid
# 9:         6     B 2014-10-09 2014-10-09 hybrid

To illustrate the idea clearly, I've created a function and copied dt inside the function. But you can avoid that and add the dates in ivals directly to dt and make use of by.x and by.y arguments in foverlaps(). Please look at ?foverlaps.

Arun
  • 116,683
  • 26
  • 284
  • 387