Use a value from the previous row in an R data.table calculation

Question

I want to create a new column in a data.table calculated from the current value of one column and the previous of another. Is it possible to access previous rows?

E.g.:

> DT <- data.table(A=1:5, B=1:5*10, C=1:5*100)
> DT
   A  B   C
1: 1 10 100
2: 2 20 200
3: 3 30 300
4: 4 40 400
5: 5 50 500
> DT[, D := C + BPreviousRow] # What is the correct code here?

The correct answer should be

> DT
   A  B   C   D
1: 1 10 100  NA
2: 2 20 200 210
3: 3 30 300 320
4: 4 40 400 430
5: 5 50 500 540

I usually set a key to my data.tables: ``DT <- data.table(A=..., key = "A")`` — PatrickT, Jun 21 '16 at 08:08

Arun · Accepted Answer · 2015-10-28T13:41:16.837

115

With shift() implemented in v1.9.6, this is quite straightforward.

DT[ , D := C + shift(B, 1L, type="lag")]
# or equivalently, in this case,
DT[ , D := C + shift(B)]

From NEWS:

New function shift() implements fast lead/lag of vector, list, data.frames or data.tables. It takes a type argument which can be either "lag" (default) or "lead". It enables very convenient usage along with := or set(). For example: DT[, (cols) := shift(.SD, 1L), by=id]. Please have a look at ?shift for more info.

See history for previous answers.

edited Oct 28 '15 at 13:41

answered Feb 04 '13 at 15:02

Arun

116,683
26
284
387

Does that `.N` hold the current row number or something? Sorry to ask here, but I can't seem to find it in the help files... – SlowLearner Feb 04 '13 at 15:24
7

@SlowLearner: You might also find `.I` useful, which holds the row indices for the rows in the curren group. – Steve Lianoglou Feb 04 '13 at 16:02
7

Use seq_len(.N - 1) instead of 1:(.N-1). This avoids problems associated with 1:0. – mnel Feb 04 '13 at 19:08
1

+1 for the `.SD` example--I was trying to use a `lapply` and getting funky results. this is much simpler. – MichaelChirico Apr 26 '15 at 22:41
Where can I find an updated pdf with all this new information ? The official 1.9.4 vignettes and webminars don't include it. And the Rmd 1.9.5 vignettes are not comfortable and don't include it either. – skan Apr 30 '15 at 16:45
@Arun , your first solution says "Error in C + shift(B, 1L, type = "lag") : non-numeric argument to binary operator" in my data.table 1.9.5 – skan Apr 30 '15 at 16:51
@skan, update to the latest commit of data.table... Made changes yesterday. If using devel, better to **watch** the project on GitHub – Arun Apr 30 '15 at 17:10
Note - for me at least - using shift was much slower that using 'slicing' with -N - see timings here (for a data.table of 1,5 mio. rows) https://stackoverflow.com/questions/58372627/how-to-construct-an-edgeliste-from-a-list-of-visited-places-effectively – Andreas Oct 21 '19 at 10:30

Steven Beaupré · Answer 2 · 2015-04-27T02:48:34.263

55

Using dplyr you could do:

mutate(DT, D = lag(B) + C)

Which gives:

#   A  B   C   D
#1: 1 10 100  NA
#2: 2 20 200 210
#3: 3 30 300 320
#4: 4 40 400 430
#5: 5 50 500 540

edited Apr 27 '15 at 02:48

answered Apr 27 '15 at 01:52

Steven Beaupré

21,343
7
57
77

score 24 · Answer 3 · answered Aug 01 '14 at 16:24

Several folks have answered the specific question. See the code below for a general purpose function that I use in situations like this that may be helpful. Rather than just getting the prior row, you can go as many rows in the "past" or "future" as you'd like.

rowShift <- function(x, shiftLen = 1L) {
  r <- (1L + shiftLen):(length(x) + shiftLen)
  r[r<1] <- NA
  return(x[r])
}

# Create column D by adding column C and the value from the previous row of column B:
DT[, D := C + rowShift(B,-1)]

# Get the Old Faithul eruption length from two events ago, and three events in the future:
as.data.table(faithful)[1:5,list(eruptLengthCurrent=eruptions,
                                 eruptLengthTwoPrior=rowShift(eruptions,-2), 
                                 eruptLengthThreeFuture=rowShift(eruptions,3))]
##   eruptLengthCurrent eruptLengthTwoPrior eruptLengthThreeFuture
##1:              3.600                  NA                  2.283
##2:              1.800                  NA                  4.533
##3:              3.333               3.600                     NA
##4:              2.283               1.800                     NA
##5:              4.533               3.333                     NA

This is a brilliant answer, I'm annoyed that I've already upvoted the other answers because this is a far more general answer. In fact, I'm going to use it in my geneorama package (if you don't mind). — geneorama, Nov 03 '14 at 19:52
Sure, go for it. I was hoping to get some free time and submit it as a pull request to the `data.table` package, but alas... — dnlbrky, Nov 03 '14 at 20:33
A similar function called `shift` has been added to `data.table` as of version 1.9.5. See the [updated answer](http://stackoverflow.com/a/14689467/1344789) from @Arun. — dnlbrky, Feb 19 '15 at 18:53

Gary Weissman · Answer 4 · 2014-05-05T13:29:33.830

13

Based on @Steve Lianoglou 's comment above, why not just:

DT[, D:= C + c(NA, B[.I - 1]) ]
#    A  B   C   D
# 1: 1 10 100  NA
# 2: 2 20 200 210
# 3: 3 30 300 320
# 4: 4 40 400 430
# 5: 5 50 500 540

And avoid using seq_len or head or any other function.

edited May 05 '14 at 13:29

answered May 04 '14 at 04:25

Gary Weissman

3,557
1
18
23

2

Nice - however this would not work if you wanted to find the previous within a group. – Matthew Sep 02 '14 at 18:39
1

@Matthew you are right. If subsetting by group I would replace `.I` with `seq_len(.N)` – Gary Weissman Feb 15 '15 at 20:00

score 9 · Answer 5 · edited Feb 04 '13 at 16:07

9

Following Arun's solution, a similar results can be obtained without referring to to .N

> DT[, D := C + c(NA, head(B, -1))][]
   A  B   C   D
1: 1 10 100  NA
2: 2 20 200 210
3: 3 30 300 320
4: 4 40 400 430
5: 5 50 500 540

edited Feb 04 '13 at 16:07

Carl Witthoft

20,573
9
43
73

answered Feb 04 '13 at 15:53

Ryogi

5,497
5
26
46

Is there a reason to prefer one method to another? Or is it simply an aesthetic difference? – Corvus Feb 04 '13 at 16:10
I think that in this scenario (i.e. where `.N` is readily available) it is mostly aesthetic choice. I am not aware of any important difference. – Ryogi Feb 04 '13 at 16:24

score 1 · Answer 6 · edited Nov 06 '18 at 08:58

Here is my intuitive solution:

#create data frame
df <- data.frame(A=1:5, B=seq(10,50,10), C=seq(100,500, 100))`
#subtract the shift from num rows
shift  <- 1 #in this case the shift is 1
invshift <- nrow(df) - shift
#Now create the new column
df$D <- c(NA, head(df$B, invshift)+tail(df$C, invshift))`

Here invshift, the number of rows minus 1, is 4. nrow(df) provides you with the number of rows in a data frame or in a vector. Similarly, if you want to take still earlier values, subtract from nrow 2, 3, ...etc, and also put NA's accordingly at the beginning.

score -2 · Answer 7 · answered Jan 29 '20 at 19:52

it can be done in a loop.

# Create the column D
DT$D <- 0
# for every row in DT
for (i in 1:length(DT$A)) {
  if(i==1) {
    #using NA at first line
    DT[i,4] <- NA
  } else {
    #D = C + BPreviousRow
    DT[i,4] <- DT[i,3] + DT[(i-1), 2]   
  }
}

Using a for, you can even use the previous value of the row of this new column DT[(i-1), 4]

Use a value from the previous row in an R data.table calculation

7 Answers7

Linked

Related