96

I want to create a new column in a data.table calculated from the current value of one column and the previous of another. Is it possible to access previous rows?

E.g.:

> DT <- data.table(A=1:5, B=1:5*10, C=1:5*100)
> DT
   A  B   C
1: 1 10 100
2: 2 20 200
3: 3 30 300
4: 4 40 400
5: 5 50 500
> DT[, D := C + BPreviousRow] # What is the correct code here?

The correct answer should be

> DT
   A  B   C   D
1: 1 10 100  NA
2: 2 20 200 210
3: 3 30 300 320
4: 4 40 400 430
5: 5 50 500 540
Corvus
  • 7,548
  • 9
  • 42
  • 68

7 Answers7

115

With shift() implemented in v1.9.6, this is quite straightforward.

DT[ , D := C + shift(B, 1L, type="lag")]
# or equivalently, in this case,
DT[ , D := C + shift(B)]

From NEWS:

  1. New function shift() implements fast lead/lag of vector, list, data.frames or data.tables. It takes a type argument which can be either "lag" (default) or "lead". It enables very convenient usage along with := or set(). For example: DT[, (cols) := shift(.SD, 1L), by=id]. Please have a look at ?shift for more info.

See history for previous answers.

Arun
  • 116,683
  • 26
  • 284
  • 387
  • Does that `.N` hold the current row number or something? Sorry to ask here, but I can't seem to find it in the help files... – SlowLearner Feb 04 '13 at 15:24
  • 7
    @SlowLearner: You might also find `.I` useful, which holds the row indices for the rows in the curren group. – Steve Lianoglou Feb 04 '13 at 16:02
  • 7
    Use seq_len(.N - 1) instead of 1:(.N-1). This avoids problems associated with 1:0. – mnel Feb 04 '13 at 19:08
  • 1
    +1 for the `.SD` example--I was trying to use a `lapply` and getting funky results. this is much simpler. – MichaelChirico Apr 26 '15 at 22:41
  • Where can I find an updated pdf with all this new information ? The official 1.9.4 vignettes and webminars don't include it. And the Rmd 1.9.5 vignettes are not comfortable and don't include it either. – skan Apr 30 '15 at 16:45
  • @Arun , your first solution says "Error in C + shift(B, 1L, type = "lag") : non-numeric argument to binary operator" in my data.table 1.9.5 – skan Apr 30 '15 at 16:51
  • @skan, update to the latest commit of data.table... Made changes yesterday. If using devel, better to **watch** the project on GitHub – Arun Apr 30 '15 at 17:10
  • Note - for me at least - using shift was much slower that using 'slicing' with -N - see timings here (for a data.table of 1,5 mio. rows) https://stackoverflow.com/questions/58372627/how-to-construct-an-edgeliste-from-a-list-of-visited-places-effectively – Andreas Oct 21 '19 at 10:30
55

Using dplyr you could do:

mutate(DT, D = lag(B) + C)

Which gives:

#   A  B   C   D
#1: 1 10 100  NA
#2: 2 20 200 210
#3: 3 30 300 320
#4: 4 40 400 430
#5: 5 50 500 540
Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77
24

Several folks have answered the specific question. See the code below for a general purpose function that I use in situations like this that may be helpful. Rather than just getting the prior row, you can go as many rows in the "past" or "future" as you'd like.

rowShift <- function(x, shiftLen = 1L) {
  r <- (1L + shiftLen):(length(x) + shiftLen)
  r[r<1] <- NA
  return(x[r])
}

# Create column D by adding column C and the value from the previous row of column B:
DT[, D := C + rowShift(B,-1)]

# Get the Old Faithul eruption length from two events ago, and three events in the future:
as.data.table(faithful)[1:5,list(eruptLengthCurrent=eruptions,
                                 eruptLengthTwoPrior=rowShift(eruptions,-2), 
                                 eruptLengthThreeFuture=rowShift(eruptions,3))]
##   eruptLengthCurrent eruptLengthTwoPrior eruptLengthThreeFuture
##1:              3.600                  NA                  2.283
##2:              1.800                  NA                  4.533
##3:              3.333               3.600                     NA
##4:              2.283               1.800                     NA
##5:              4.533               3.333                     NA
dnlbrky
  • 9,396
  • 2
  • 51
  • 64
  • This is a brilliant answer, I'm annoyed that I've already upvoted the other answers because this is a far more general answer. In fact, I'm going to use it in my geneorama package (if you don't mind). – geneorama Nov 03 '14 at 19:52
  • Sure, go for it. I was hoping to get some free time and submit it as a pull request to the `data.table` package, but alas... – dnlbrky Nov 03 '14 at 20:33
  • A similar function called `shift` has been added to `data.table` as of version 1.9.5. See the [updated answer](http://stackoverflow.com/a/14689467/1344789) from @Arun. – dnlbrky Feb 19 '15 at 18:53
13

Based on @Steve Lianoglou 's comment above, why not just:

DT[, D:= C + c(NA, B[.I - 1]) ]
#    A  B   C   D
# 1: 1 10 100  NA
# 2: 2 20 200 210
# 3: 3 30 300 320
# 4: 4 40 400 430
# 5: 5 50 500 540

And avoid using seq_len or head or any other function.

Gary Weissman
  • 3,557
  • 1
  • 18
  • 23
9

Following Arun's solution, a similar results can be obtained without referring to to .N

> DT[, D := C + c(NA, head(B, -1))][]
   A  B   C   D
1: 1 10 100  NA
2: 2 20 200 210
3: 3 30 300 320
4: 4 40 400 430
5: 5 50 500 540
Carl Witthoft
  • 20,573
  • 9
  • 43
  • 73
Ryogi
  • 5,497
  • 5
  • 26
  • 46
  • Is there a reason to prefer one method to another? Or is it simply an aesthetic difference? – Corvus Feb 04 '13 at 16:10
  • I think that in this scenario (i.e. where `.N` is readily available) it is mostly aesthetic choice. I am not aware of any important difference. – Ryogi Feb 04 '13 at 16:24
1

Here is my intuitive solution:

#create data frame
df <- data.frame(A=1:5, B=seq(10,50,10), C=seq(100,500, 100))`
#subtract the shift from num rows
shift  <- 1 #in this case the shift is 1
invshift <- nrow(df) - shift
#Now create the new column
df$D <- c(NA, head(df$B, invshift)+tail(df$C, invshift))`

Here invshift, the number of rows minus 1, is 4. nrow(df) provides you with the number of rows in a data frame or in a vector. Similarly, if you want to take still earlier values, subtract from nrow 2, 3, ...etc, and also put NA's accordingly at the beginning.

Matthew Ciaramitaro
  • 1,184
  • 1
  • 13
  • 27
-2

it can be done in a loop.

# Create the column D
DT$D <- 0
# for every row in DT
for (i in 1:length(DT$A)) {
  if(i==1) {
    #using NA at first line
    DT[i,4] <- NA
  } else {
    #D = C + BPreviousRow
    DT[i,4] <- DT[i,3] + DT[(i-1), 2]   
  }
}

Using a for, you can even use the previous value of the row of this new column DT[(i-1), 4]

Rafael Braga
  • 177
  • 2
  • 4