8

I am looking for a function in R similar to lag1, lag2 and retain functions in SAS which I can use with data.tables.

I know there are functions like embed and lag in R but they don't return a single value or the previous value . They return a complete set of vectors.

Is there anything in R which I can use with data.table?

More info on the SAS functions :

Joe
  • 62,789
  • 6
  • 49
  • 67
user2786962
  • 469
  • 5
  • 7
  • 13
  • 10
    Most of us R users either never used SAS or burned our SAS manuals to keep warm many years ago. What do these functions do in SAS? – Spacedman Dec 17 '13 at 14:38
  • This previous post might help you: http://stackoverflow.com/questions/3558988/basic-lag-in-r-vector-dataframe – ESmyth5988 Dec 17 '13 at 14:43
  • 1
    Joris, I'm not going to get into an edit war, but I made those edits for a reason. This isn't a SAS question, so the tag is not needed; and the question can and should easily stand alone as a question about how R works without any reference to SAS beyond referring to the lag function as a helper to those who do know both languages (and a guide to future users who are thinking the same thing about lag). An answerer of the question should not need to read SAS documentation to answer this question. – Joe Dec 17 '13 at 16:31
  • 1
    @Joe The problem of OP is directly related with his SAS experience and the differences between SAS and R. His question only makes sense when you know how the DATA step in SAS works, and how this differs from the R approach. The main problem is that his question doesn't make sense in an R context; there is no such thing as a "previous record" in R like returned by the `LAG` function in SAS. Hence my (blunt) reverse of your edit. Sorry if I came accross rude, that was not my intention. I'm socially not that strong. Luckily my computer can handle that ;-) – Joris Meys Dec 17 '13 at 16:53
  • I understand there's no concept of it, and my edit probably wasn't very clear (SAS user, not R user, here). But the question is supposed to be written to stand on its own as a [tag:r] question; it shouldn't _require_ knowledge of SAS to answer. Also, as it's not a question about SAS (it happens to refer to SAS, but SAS is not what the question is about) it should not be tagged [tag:SAS]. I will remove the tag and let you all figure out whether you want to reword the question or leave it alone - but please leave the [tag:sas] off, it really shouldn't be there as a tag. – Joe Dec 17 '13 at 17:03
  • 1
    I have great sympathy for the SAS user coming to R who is trying to replicate retain. I searched for and found a `lag` function in my early hours of R use and got completely sidetracked trying to use the screwed up R time-series objects. NewRbies should instead learn to use the 'zoo' package. – IRTFM Dec 17 '13 at 18:05
  • 1
    @Joe the use of the SAS tag is perfectly justified in order to maximise viewing of people who follow either/both the R and SAS tags possibly independently. Someone who views SAS questions may also have the knowledge of R to answer this. I don't think we need to get that pedantic about a single tag! – Simon O'Hanlon Dec 18 '13 at 12:52
  • @Joe : where in the FAQ, tour or help center of this site is stated that question should stand as an [XYZ]-question on its own? It's a question of a SAS user that wants SAS concepts translated to R. As a SAS AND R user (yes, they do exist), that's how I answered it as well. I don't understand why you insist on deleting the SAS tag. – Joris Meys Dec 18 '13 at 13:49
  • The point of the tag is the question and the people who might *ask* it or search for it later, not the people who might *answer* it. This question would not be of help to a SAS user in any way (unless they also use R), so it's not appropriate for [tag:sas]. Feel free to open a thread on [meta] if you want to discuss this further, but this is the standard consistently agreed on in various meta threads, and I popped into Tavern on the Meta (chat) to ask some of the mods there what they thought, and they agreed - no [tag:sas] tag. – Joe Dec 18 '13 at 16:10
  • @JorisMeys, specifically regarding standing on its own: You're not entirely wrong, in that it's not bad to reference the source of the question (lag from SAS, how to do in R). And perhaps there are enough people just doing that, as DWin notes, that the question should simply be 'how to do SAS's Lag in R'. However, even better is a question that explains exactly what the user wants to do without solely referring to lag, so that when a future R user searches for it, they understand it as useful. Note the first comment on the question from Spacedman, after all. – Joe Dec 18 '13 at 16:14
  • @Joe http://r4stats.com/books/r4sas-spss/ Sometimes all a SAS user need, is a translation. – Joris Meys Dec 18 '13 at 16:23

4 Answers4

9

You have to be aware that R works very different from the data step in SAS. The lag function in SAS is used in the data step, and is used within the implicit loop structure of that data step. The same goes for the retain function, which simply keeps the value constant when going through the data looping.

R on the other hand works completely vectorized. This means that you have to rethink what you want to do, and adapt accordingly.

  • retain is simply useless in R, as R recycles arguments by default. If you want to do this explicitly, you might look at eg rep() to construct a vector with constant values and a certain length.
  • lag is a matter of using indices, and just shifting position of all values in a vector. In order to keep a vector of the same length, you need to add some NA and remove some extra values.

A simple example: This SAS code lags a variable x and adds a variable year that has a constant value:

data one;
   retain year 2013;
   input x @@;
   y=lag1(x);
   z=lag2(x);
   datalines;
1 2 3 4 5 6
;

In R, you could write your own lag function like this:

mylag <- function(x,k) c(rep(NA,k),head(x,-k))

This single line adds k times NA at the beginning of the vector, and drops the last k values from the vector. The result is a lagged vector as given by lag1 etc. in SAS.

this allows something like :

nrs <- 1:6 # equivalent to datalines
one <- data.frame(
   x = nrs,
   y = mylag(nrs,1),
   z = mylag(nrs,2),
   year = 2013  # R automatically loops, so no extra command needed
)

The result is :

> one
  x  y  z year
1 1 NA NA 2013
2 2  1 NA 2013
3 3  2  1 2013
4 4  3  2 2013
5 5  4  3 2013
6 6  5  4 2013

Exactly the same would work with a data.table object. The important note here is to rethink your strategy: Instead of thinking loopwise as you do with the DATA step in SAS, you have to start thinking in terms of vectors and indices when using R.

Joris Meys
  • 106,551
  • 31
  • 221
  • 263
5

I would say the closet equivalent to retain, lag1, and lag2 would be the Lag function in the quantmod package.

It's very easy to use with data.tables. E.g.:

library(data.table)
library(quantmod)
d <- data.table(v1=c(rep('a', 10), rep('b', 10)), v2=1:20)
setkeyv(d, 'v1')
d[,new_var := Lag(v2, 1), by='v1']
d[,new_var2 := v2-Lag(v2, 3), by='v1']
d[,new_var3 := Next(v2, 2), by='v1']

This yields the following:

print(d)
    v1 v2 new_var new_var2 new_var3
 1:  a  1      NA       NA        3
 2:  a  2       1       NA        4
 3:  a  3       2       NA        5
 4:  a  4       3        3        6
 5:  a  5       4        3        7
 6:  a  6       5        3        8
 7:  a  7       6        3        9
 8:  a  8       7        3       10
 9:  a  9       8        3       NA
10:  a 10       9        3       NA
11:  b 11      NA       NA       13
12:  b 12      11       NA       14
13:  b 13      12       NA       15
14:  b 14      13        3       16
15:  b 15      14        3       17
16:  b 16      15        3       18
17:  b 17      16        3       19
18:  b 18      17        3       20
19:  b 19      18        3       NA
20:  b 20      19        3       NA

As you can see, Lag lets you look back and Next lets you look forward. Both functions are nice because they pad the result with NAs such that it has the same length as the input.

If you want to get even fancier, and higher-performance, you can look into rolling joins with data.table objects. This is a little bit different thab what you are asking for, but is conceptually related, and so powerful and awesome I have to share.

Start with a data.table:

library(data.table)
library(quantmod)
set.seed(42)
d1 <- data.table(
    id=c(rep('a', 10), rep('b', 10)), 
    time=rep(1:10,2), 
    value=runif(20))
setkeyv(d1, c('id', 'time'))
print(d1)

    id time     value
 1:  a    1 0.9148060
 2:  a    2 0.9370754
 3:  a    3 0.2861395
 4:  a    4 0.8304476
 5:  a    5 0.6417455
 6:  a    6 0.5190959
 7:  a    7 0.7365883
 8:  a    8 0.1346666
 9:  a    9 0.6569923
10:  a   10 0.7050648
11:  b    1 0.4577418
12:  b    2 0.7191123
13:  b    3 0.9346722
14:  b    4 0.2554288
15:  b    5 0.4622928
16:  b    6 0.9400145
17:  b    7 0.9782264
18:  b    8 0.1174874
19:  b    9 0.4749971
20:  b   10 0.5603327

You have another data.table you want to join, but not all time indexes are present in the second table:

d2 <- data.table(
        id=sample(c('a', 'b'), 5, replace=TRUE), 
        time=sample(1:10, 5), 
        value2=runif(5))
setkeyv(d2, c('id', 'time'))
print(d2)
   id time      value2
1:  a    4 0.811055141
2:  a   10 0.003948339
3:  b    6 0.737595618
4:  b    8 0.388108283
5:  b    9 0.685169729

A regular merge yields lots of missing values:

d2[d1,,roll=FALSE]
    id time      value2     value
 1:  a    1          NA 0.9148060
 2:  a    2          NA 0.9370754
 3:  a    3          NA 0.2861395
 4:  a    4 0.811055141 0.8304476
 5:  a    5          NA 0.6417455
 6:  a    6          NA 0.5190959
 7:  a    7          NA 0.7365883
 8:  a    8          NA 0.1346666
 9:  a    9          NA 0.6569923
10:  a   10 0.003948339 0.7050648
11:  b    1          NA 0.4577418
12:  b    2          NA 0.7191123
13:  b    3          NA 0.9346722
14:  b    4          NA 0.2554288
15:  b    5          NA 0.4622928
16:  b    6 0.737595618 0.9400145
17:  b    7          NA 0.9782264
18:  b    8 0.388108283 0.1174874
19:  b    9 0.685169729 0.4749971
20:  b   10          NA 0.5603327

However, data.table allows you to roll the secondary index forward, WITHIN THE PRIMARY INDEX!

d2[d1,,roll=TRUE]
    id time      value2     value
 1:  a    1          NA 0.9148060
 2:  a    2          NA 0.9370754
 3:  a    3          NA 0.2861395
 4:  a    4 0.811055141 0.8304476
 5:  a    5 0.811055141 0.6417455
 6:  a    6 0.811055141 0.5190959
 7:  a    7 0.811055141 0.7365883
 8:  a    8 0.811055141 0.1346666
 9:  a    9 0.811055141 0.6569923
10:  a   10 0.003948339 0.7050648
11:  b    1          NA 0.4577418
12:  b    2          NA 0.7191123
13:  b    3          NA 0.9346722
14:  b    4          NA 0.2554288
15:  b    5          NA 0.4622928
16:  b    6 0.737595618 0.9400145
17:  b    7 0.737595618 0.9782264
18:  b    8 0.388108283 0.1174874
19:  b    9 0.685169729 0.4749971
20:  b   10 0.685169729 0.5603327

This is pretty damn cool: Old observations are rolled forward in time, until they are replaced by new ones. If you want to replace the NA values at the beggining of the series, you can do so by rolling the first observation backwards:

d2[d1,,roll=TRUE, rollends=c(TRUE, TRUE)]
    id time      value2     value
 1:  a    1 0.811055141 0.9148060
 2:  a    2 0.811055141 0.9370754
 3:  a    3 0.811055141 0.2861395
 4:  a    4 0.811055141 0.8304476
 5:  a    5 0.811055141 0.6417455
 6:  a    6 0.811055141 0.5190959
 7:  a    7 0.811055141 0.7365883
 8:  a    8 0.811055141 0.1346666
 9:  a    9 0.811055141 0.6569923
10:  a   10 0.003948339 0.7050648
11:  b    1 0.737595618 0.4577418
12:  b    2 0.737595618 0.7191123
13:  b    3 0.737595618 0.9346722
14:  b    4 0.737595618 0.2554288
15:  b    5 0.737595618 0.4622928
16:  b    6 0.737595618 0.9400145
17:  b    7 0.737595618 0.9782264
18:  b    8 0.388108283 0.1174874
19:  b    9 0.685169729 0.4749971
20:  b   10 0.685169729 0.5603327

These rolling joins are absolutely incredible, and I've never seen them implemented in any other open source package (see ?data.table for more info). It will take a little while to turn off your "SAS brain" and turn on your "R brain", but once you get over that initial hump you'll find that the language is much more expressive.

Zach
  • 29,791
  • 35
  • 142
  • 201
0

For retain, try this :

retain<-function(x,event,outside=NA)
{
  indices <- c(1,which(event==TRUE), nrow(df)+1)
  values <- c(outside,x[event==TRUE])
  y<- rep(values, diff(indices)) 
}

With data : I want to retain down the value when w==b

df <- data.frame(w = c("a","b","c","a","b","c"), x = 1:6, y = c(1,1,2,2,2,3), stringsAsFactors = FALSE)
df$z<-retain(df$x-df$y,df$w=="b")
df

And here's the contrary obtain, that does not exist in SAS:

obtain<-function(x,event,outside=NA)
{
  indices <- c(0,which(event==TRUE), nrow(df))
  values <- c(x[event==TRUE],outside)
  y<- rep(values, diff(indices)) 
}

Here's an example. I want to obtain the value in advance where w==b

df$z2<-obtain(df$x-df$y,df$w=="b")
df

Thanks to Julien for helping.

0

here's an example: cumulate value with sqldf:

> w_cum <-
 sqldf("select t1.id, t1.SomeNumt, SUM(t2.SomeNumt) as cum_sum
       from w_cum       t1
       inner join w_cum t2 on t1.id >= t2.id
       group by t1.id, t1.SomeNumt
       order by t1.id

")

   id SomeNumt cum_sum
  • 1 11 11
  • 2 12 23
  • 3 13 36
  • 4 14 50
  • 5 15 65
  • 6 16 81
  • 7 17 98
  • 8 18 116
  • 9 19 135
  • 10 20 155