0

I have a huge dataset that I want to loop over each user in the first column and calculate interval column based on the difference (subtract) between each row with the next row of start_time column. The last row of each user can be 0 or NA because there is no other record after that user.

    df=read.table(text="
    user   start_time   
    1       4
    1       6
    1       10
    1       11
    2       1
    2       3
    2       5
    3       4
    3       4",header=T)

result
user   start_time  interval 
1       4          2 <-- (this is the result of 6-4)
1       6          3 <-- (9-6)
1       9          2 <-- (11-9)    
1       11         NA <-- (or can be 0) because it is last row of the user
2       1          2
2       3          3
2       6          NA <-- (or 0)
3       4          0
3       4          NA <--(or 0)

I would prefer something fast like group_by function. How can I do it in R?

Cina
  • 9,759
  • 4
  • 20
  • 36
  • check out `diff`? – chinsoon12 Feb 04 '20 at 03:14
  • 1
    `df %>% group_by(user) %>% mutate(interval = lead(start_time) - start_time) ` If you replace `lag` in the marked answer with `lead` all of them would work. For `shift`, use `type = "lead"` (default is `"lag"`). If you use `diff` do `c(diff(x), 0)` instead of `c(0, diff(x))` – Ronak Shah Feb 04 '20 at 03:26
  • use `data.table` , `setDT(df)[, diff := shift(start_time, type='lead')-start_time, .(user)]` – PKumar Feb 04 '20 at 03:30

0 Answers0