57

I want to find the lead() and lag() element in each group, but had some wrong results.

For example, data is like this:

library(dplyr)
df = data.frame(name=rep(c('Al','Jen'),3),
                score=rep(c(100, 80, 60),2))
df

Data:

  name score
1   Al   100
2  Jen    80
3   Al    60
4  Jen   100
5   Al    80
6  Jen    60

Now I try to find out lead() and lag() scores for each person. If I sort it using arrange(), I can get the correct answer:

df %>%
  arrange(name) %>%
  group_by(name) %>%
  mutate(next.score = lead(score),
         before.score = lag(score) )

OUTPUT1:

Source: local data frame [6 x 4]
Groups: name

      name score next.score before.score
    1   Al   100         60           NA
    2   Al    60         80          100
    3   Al    80         NA           60
    4  Jen    80        100           NA
    5  Jen   100         60           80
    6  Jen    60         NA          100

Without arrange(), the result is wrong:

df %>%
  group_by(name) %>%
  mutate(next.score = lead(score),
         before.score = lag(score) )

OUTPUT2:

Source: local data frame [6 x 4]
Groups: name

  name score next.score before.score
1   Al   100         80           NA
2  Jen    80         60           NA
3   Al    60        100           80
4  Jen   100         80           60
5   Al    80         NA          100
6  Jen    60         NA           80

E.g., in 1st line, Al's next.score should be 60 (3rd line).

Anybody know why this happened? Why arrange() affects the result (the values, not just about the order)? Thanks~

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
YJZ
  • 3,934
  • 11
  • 43
  • 67
  • @DavidArenburg it's not the sorting, the OP asks why the result is 80 when in the original data frame the next result is 60. It's like Jen's result was picked instead of Al's – Panagiotis Kanavos Jan 30 '15 at 12:03
  • And I can't repro. Which version of R are you using? I get `1 Al 100 60 NA` with R 3.1.2 on Windows 7 – Panagiotis Kanavos Jan 30 '15 at 12:05
  • @PanagiotisKanavos, yeah you are right. I didn't notice that. – David Arenburg Jan 30 '15 at 12:08
  • 1
    I can reproduce the weird results (`0.4.1.9000`). I think (after a quick, groggy-eyed glance at the source of the series of function calls) it's because the underlying code is going by actual overall row-index instead of the relative row-index. That might explain `lead` (I think `pmin` is the place of the weirdness), but not sure what's going on with `lag` (didn't look there). – hrbrmstr Jan 30 '15 at 12:09
  • I had 0.3.0.2 which I installed yesterday, and I can't repro the results. Default mirror was 0-Cloud – Panagiotis Kanavos Jan 30 '15 at 12:10
  • @PanagiotisKanavos I bleeding edge the hadleyverse every couple of days from devtools/github. – hrbrmstr Jan 30 '15 at 12:12
  • 8
    This seems to be a bug in the latest version 0.4.1 of `dplyr` and was already reported [here](https://github.com/hadley/dplyr/issues/925) – alex23lemm Jan 30 '15 at 12:16
  • @hrbrmstr no, just 0 seems to be far behind any other mirror. And I get the same error now. – Panagiotis Kanavos Jan 30 '15 at 12:18
  • I see thanks! I think for now I can sort the data first using arrange() to avoid this problem. – YJZ Jan 30 '15 at 22:21

3 Answers3

54

It seems you have to pass additional argument to lag and lead functions. When I run your function without arrange, but with order_by added, everything seems to be ok.

df %>%
group_by(name) %>%
mutate(next.score = lead(score, order_by=name),
before.score = lag(score, order_by=name))

Output:

  name score next.score before.score
1   Al   100         60           NA
2  Jen    80        100           NA
3   Al    60         80          100
4  Jen   100         60           80
5   Al    80         NA           60
6  Jen    60         NA          100

My sessionInfo():

R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=Polish_Poland.1250  LC_CTYPE=Polish_Poland.1250        LC_MONETARY=Polish_Poland.1250
[4] LC_NUMERIC=C                   LC_TIME=Polish_Poland.1250    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.4.1

loaded via a namespace (and not attached):
[1] assertthat_0.1  DBI_0.3.1       lazyeval_0.1.10 magrittr_1.5                parallel_3.1.1  Rcpp_0.11.5    
[7] tools_3.1.1 
Tomasz Sosiński
  • 849
  • 1
  • 10
  • 12
  • 4
    I am working with a similar case of lag, but with one change - multiple columns for grouping and ordering! If there were multiple columns in group_by and order_by, how different will the answer be? I tried passing vectors but that doesn't help. – Akshay Rane Mar 05 '17 at 14:25
24

It may happen that stats::lag is used instead (e.g. when restoring environments with the session package). This can easly slip through unnoticed as it it won't throw an error when being used as in the question. Double-check by simply typing lag, use conflicted package, or disambiguate the function call by calling dplyr::lag instead.

The same could happen for plyr::mutate, in case you might have loaded plyr package in your session. So make sure you're also calling dplyr::mutate.

Titorelli
  • 55
  • 8
Holger Brandl
  • 10,634
  • 3
  • 64
  • 63
  • 2
    many packages have a lag function - as this comment states, verifying / disambiguation is crucial unless you are simply relying upon a small set of packages or base R + dplyr alone. – HoneyBuddha Sep 08 '19 at 08:20
  • 3
    I wasted the last 2 hours trying to understand what was suddenly wrong in a code that was supposed to run, before finding this GodSend reply stressing dplyr:mutate and dplyr:lag. This needs much more attention. Thank you sir. – Bob May 24 '21 at 15:23
  • Reiterating what everyone else has said. Spent so long wondering when a very simple piece of code suddenly isn't working. Very frustrating! Your answer is so simple, yet so easy to overlook, and worked a treat!!! Thank you – DataMonkey Jul 25 '23 at 08:19
6

Using order_by is good when you have only one grouping variable. In case of multiple grouping variable, I could not find any solution except, writing and reading the table to get rid of grouping variables. It worked pretty well for me, but its efficiency depends on the size of table.

Adrian
  • 213
  • 4
  • 9
  • 4
    I created a dummy grouping variable for this case to allow using order_by: `mutate(grouping=sprintf("%04d-%04d",var1,var2)) %>% mutate(next.score = lead(score, order_by=grouping) %>% select(-grouping)` – Quantum7 Nov 27 '17 at 14:38