6

This is similar to this dplyr lag post, and this dplyr mutate lag post, but neither of those ask this question about defaulting to the input value. I am using dplyr to mutate a new field that's a lagged offset of another field (that I've converted to POSIXct). The goal is, for a given ip, I'd like to know some summary statistics on the delta between all the times it shows up on my list. I also have about 12 million rows.

The data look like this (prior to mutation)

ip             hour         snap
192.168.1.2    2017070700    0
192.168.1.2    2017070700   15
192.168.1.4    2017070700    0
192.168.1.4    2017070701   45
192.168.1.4    2017070702   30
192.168.1.7    2017070700   15

'hour' is an integer, but should be a timestamp.

'snap' is one of 4 'snapshot' values that represent 15 minute increments.

Here's the data.frame creation code:

test <- data.frame(ip=c("192.168.1.2","192.168.1.2","192.168.1.4","192.168.1.4","192.168.1.4","192.168.1.7"), hour=c(2017070700,2017070700,2017070700,2017070701,2017070702,2017070700), snap=c(0,15,0,45,30,15))

There are hundreds and sometimes thousands of timestamps per ip. The code below uses dplyr to

  • a) pad the 0's with a leading 0,
  • b) concat the two integer 'date' fields into one field,
  • c) convert merged integer 'date' field to a POSIX date,
  • d) group by ip,
  • e) mutate a new column that's lagged of the old timestamp by 1 and, if the value is NA, refer back to the original value (THIS IS THE BIT THAT DOESN"T WORK), and
  • f) mutate a new column that takes the difference of the current time and the previous time (by ip).

These steps refer to the comments at the end of each line.

timedelta <- test %>% 
  mutate(snap = formatC(snap, width=2, flag=0)) %>%                      # a) 
  mutate(fulldateint = paste(hour, snap, sep="")) %>%                    # b) 
  mutate(fulldate = as.POSIXct(strptime(fulldateint, "%Y%m%d%H%M"))) %>% # c) 
  group_by(ip) %>%                                                       # d) 
  mutate(shifted = dplyr::lag(fulldate, default=fulldate)) %>%           # e) 
  mutate(diff = fulldate-shifted)                                        # f) 

After mutation, the data should look like this:

           ip       hour  snap  fulldateint            fulldate             shifted      diff
       <fctr>      <dbl> <chr>        <chr>              <dttm>              <dttm>    <time>
1 192.168.1.2 2017070700    00 201707070000 2017-07-07 00:00:00 2017-07-07 00:00:00    0 secs
2 192.168.1.2 2017070700    15 201707070015 2017-07-07 00:15:00 2017-07-07 00:00:00  900 secs
3 192.168.1.4 2017070700    00 201707070000 2017-07-07 00:00:00 2017-07-07 00:00:00    0 secs
4 192.168.1.4 2017070701    45 201707070145 2017-07-07 01:45:00 2017-07-07 00:00:00 6300 secs
5 192.168.1.4 2017070702    30 201707070230 2017-07-07 02:30:00 2017-07-07 01:45:00 2700 secs
6 192.168.1.7 2017070700    15 201707070015 2017-07-07 00:15:00 2017-07-07 00:15:00    0 secs

And if I could get lag to default to its original value, the 'delta-T' would always be 0 when it doesn't have a previous value (which is the desired result).

However, dplyr::lag(fulldate, default=fulldate) throws the error

Error in mutate_impl(.data, dots) : 
Column `shifted` must be length 2 (the group size) or one, not 3

It does work if I use fulldate1, but then I lose the group_by(ip) result, which is necessary. Is it possible to make lag reference its own input within dplyr?

Note: I really would prefer an answer using dplyr and not data.table, if possible, since I've been using dplyr as our primary data munging library, but also since I'd like to suggest to Mr. Wickham that he take this under consideration if it truly has no solution in the existing dplyr library.

Brian
  • 7,900
  • 1
  • 27
  • 41
TheProletariat
  • 916
  • 2
  • 11
  • 23
  • 2
    Judging by your title, you could ask this in a much shorter manner... I guess you want `dplyr::lag(fulldate, default = first(fulldate))`? – Frank Aug 18 '17 at 21:03
  • Point taken. Does first work alongside group_by()? – TheProletariat Aug 18 '17 at 21:41
  • 1
    Ok. For future reference, I meant that unless your question is about formatC, as.POSIXct, et al, you can do those data tweaks beforehand and then use `dput`; see https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/28481250#28481250 if you're not familiar with it. Btw, I think the `default` argument needs a single value, but instead you gave it a vector -- I think you missed this point, so I'm just trying to clarify it. Yeah, `first` inside `mutate` after `group_by` will work on each group separately. – Frank Aug 18 '17 at 21:56
  • Thanks for the tips. Very good stuff. And yes, if first() works on each group separately, then this is exactly what I was looking for. Can you please repost as an answer so I can accept? Also, would you suggest truncating the novel to a more readable format for posterity? – TheProletariat Aug 18 '17 at 22:00
  • 1
    Ok, I'll post. Nah, no need to edit it, I think; just useful to keep in mind next time. The title's good, so folks will hopefully be able to find it and opt to skip the novel to read the answers. – Frank Aug 18 '17 at 22:01
  • 1
    Also, I should mention that I had tried dplyr::lag(fulldate, default=fulldate[1])) for the reasons you stated but it didn't work because it doesn't select the first element per group, just the first element of the entire column. – TheProletariat Aug 18 '17 at 22:02
  • Oh, that's interesting and unexpected. – Frank Aug 18 '17 at 22:04

3 Answers3

8

In the OP's code ...

...
d) group_by(ip) %>%
e) mutate(shifted = dplyr::lag(fulldate, default=fulldate)) %>%
...

The default= argument should have a length of one. Replacing the OP's code with default = first(fulldate) should work in this case (since the first element won't have a lag and so is where we need to apply the default value).

Related cases:

  • Similarly, with a "lead", we'd want dplyr::lead(x, default=last(x)).
  • With a lag or lead of more than one step (n greater than 1), default= cannot do it and we'd probably need to switch to if_else or case_when or similar. (I'm not sure about the current tidyverse idiom.)
Frank
  • 66,179
  • 8
  • 96
  • 180
3

I think Frank's solution works pretty well. Here is the complete example:


library(dplyr, warn.conflicts = F)

test <- data.frame(ip=c("192.168.1.2","192.168.1.2","192.168.1.4","192.168.1.4","192.168.1.4","192.168.1.7"),
                   hour=c(2017070700,2017070700,2017070700,2017070701,2017070702,2017070700),
                   snap=c(0,15,0,45,30,15))


test %>%
  mutate(snap = formatC(snap, width = 2, flag = 0)) %>%
  mutate(fulldateint = paste(hour, snap, sep = "")) %>%
  mutate(fulldate = as.POSIXct(strptime(fulldateint, "%Y%m%d%H%M"))) %>%
  group_by(ip) %>%
  mutate(shifted = lag(fulldate, default = first(fulldate))) %>%
  mutate(diff = fulldate - shifted) %>% 
  ungroup() %>% 
  select(ip, fulldate, shifted, diff)

#> # A tibble: 6 x 4
#>            ip            fulldate             shifted      diff
#>        <fctr>              <dttm>              <dttm>    <time>
#> 1 192.168.1.2 2017-07-07 00:00:00 2017-07-07 00:00:00    0 secs
#> 2 192.168.1.2 2017-07-07 00:15:00 2017-07-07 00:00:00  900 secs
#> 3 192.168.1.4 2017-07-07 00:00:00 2017-07-07 00:00:00    0 secs
#> 4 192.168.1.4 2017-07-07 01:45:00 2017-07-07 00:00:00 6300 secs
#> 5 192.168.1.4 2017-07-07 02:30:00 2017-07-07 01:45:00 2700 secs
#> 6 192.168.1.7 2017-07-07 00:15:00 2017-07-07 00:15:00    0 secs
LVG77
  • 196
  • 6
1

How about

ifelse(is.na(lag(value)), value, lag(value))
mikeck
  • 3,534
  • 1
  • 26
  • 39
  • I should have mentioned that I tried some conditional NA replacement but dplyr is apparently cranky about conditional statements, as in the first link I posted above... regardless, your idea works, but it somehow converts back to a double, which is very bizarre. I used this syntax: mutate(shifted = ifelse(is.na(lag(fulldate)), fulldate, lag(fulldate))), but instead of shifted values like this (2017-07-07 00:00:00), it gave me values like this: (1499407200). I don't think I made a syntax or naming error, but happy for a 2nd(hundred) set of eyes. – TheProletariat Aug 18 '17 at 20:43
  • 1
    @TheProletariat if_else() is an dplyr alternative to ifelse() that maintains format – MBorg Mar 30 '21 at 07:34