3

I have a question that combines these two questions in some way (Substract date from previous row by group (using R) and subtract value from previous row by group). But cannot manage to get script to work.

I have a dataset that looks something like this:

id  |     date    |  min  
 1  |  2015-07-18 |  25
 1  |  2015-07-22 |  15
 1  |  2015-07-23 |  10
 1  |  2015-07-30 |  15
 2  |  2015-07-10 |  10
 2  |  2015-07-16 |  20
 2  |  2015-07-23 |  10

And I want to create a new column totmin that adds the total number of minutes played in the last 7 days by id:

id  |     date    |  min  |  totmin
 1  |  2015-07-18 |  25   |    25
 1  |  2015-07-22 |  15   |    40
 1  |  2015-07-23 |  10   |    50
 1  |  2015-07-30 |  15   |    25
 2  |  2015-07-10 |  10   |    10
 2  |  2015-07-16 |  20   |    30
 2  |  2015-07-23 |  10   |    30

I tried with lag but do not how to restrict to only 7 days.

Agustín Indaco
  • 550
  • 5
  • 17

2 Answers2

4

We can group_by id and sum min value for each date for 7 day period.

library(dplyr)

df %>%
  group_by(id) %>%
  mutate(totmin = purrr::map_dbl(date, ~sum(min[between(date, . - 7, .)])))

#     id  date         min totmin
#   <int> <date>     <int>  <dbl>
#1     1 2015-07-18    25     25
#2     1 2015-07-22    15     40
#3     1 2015-07-23    10     50
#4     1 2015-07-30    15     25
#5     2 2015-07-10    10     10
#6     2 2015-07-16    20     30
#7     2 2015-07-23    10     30

data

df <- structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L), date = structure(c(16634, 
16638, 16639, 16646, 16626, 16632, 16639), class = "Date"), min = c(25L, 
15L, 10L, 15L, 10L, 20L, 10L)), row.names = c(NA, -7L), class = "data.frame")
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
1

This can be done using fuzzyjoin:

library(dplyr)
df <- tribble(
    ~id, ~date, ~min,
    1, "2015-07-18", 25,
    1, "2015-07-22", 15,
    1, "2015-07-23", 10,
    1, "2015-07-30", 15,
    2, "2015-07-10", 10,
    2, "2015-07-16", 20,
    2, "2015-07-23", 10
  ) %>% 
  mutate(date = as.Date(date))

We'll join df with itself by id and date, selecting rows from the second df where id is the same, and second date is between first date and first date - 7.

library(fuzzyjoin)

df_join <-
  fuzzy_left_join(
    df, df,
    by = c("id", "date"),
    match_fun = c(
      "id" = `==`,
      "date" = function(x, y) {y <= x & y >= x - 7}
    )
  )

df_join
#> # A tibble: 13 x 6
#>     id.x date.x     min.x  id.y date.y     min.y
#>    <dbl> <date>     <dbl> <dbl> <date>     <dbl>
#>  1     1 2015-07-18    25     1 2015-07-18    25
#>  2     1 2015-07-22    15     1 2015-07-18    25
#>  3     1 2015-07-22    15     1 2015-07-22    15
#>  4     1 2015-07-23    10     1 2015-07-18    25
#>  5     1 2015-07-23    10     1 2015-07-22    15
#>  6     1 2015-07-23    10     1 2015-07-23    10
#>  7     1 2015-07-30    15     1 2015-07-23    10
#>  8     1 2015-07-30    15     1 2015-07-30    15
#>  9     2 2015-07-10    10     2 2015-07-10    10
#> 10     2 2015-07-16    20     2 2015-07-10    10
#> 11     2 2015-07-16    20     2 2015-07-16    20
#> 12     2 2015-07-23    10     2 2015-07-16    20
#> 13     2 2015-07-23    10     2 2015-07-23    10

Now we have to group by id and first date and calculate the total number of minutes.

res <- 
  df_join %>% 
  select(id = id.x, date = date.x, min.x, min.y) %>% 
  group_by(id, date) %>% 
  summarise(min = first(min.x), totmin = sum(min.y))
res
#> # A tibble: 7 x 4
#> # Groups:   id [2]
#>      id date         min totmin
#>   <dbl> <date>     <dbl>  <dbl>
#> 1     1 2015-07-18    25     25
#> 2     1 2015-07-22    15     40
#> 3     1 2015-07-23    10     50
#> 4     1 2015-07-30    15     25
#> 5     2 2015-07-10    10     10
#> 6     2 2015-07-16    20     30
#> 7     2 2015-07-23    10     30
Iaroslav Domin
  • 2,698
  • 10
  • 19