2

I have a very large data set where I want to to take the lagged value of a row for the same by group. This functionality is well documented in the data.table package and I've seen other examples of it. However with my data I keep getting an error and I cannot figure out why.

According to documentation, this is the process. I can run this and it works.

library(data.table)
DT = data.table(year=rep(2010:2011, each=3), v1=1:6)
DT[, c("lag1", "lag2") := shift(.SD, 1:2), by=year]

Now here is a sample of my data.

unique_id click_time
100005361 2017-11-09 03:58:32
100005371 2017-11-09 00:53:08
100005371 2017-11-09 04:38:52
100005371 2017-11-09 05:42:30
100005371 2017-11-09 05:42:31
100005371 2017-11-09 05:42:31
100005371 2017-11-09 05:42:32
100005371 2017-11-09 05:42:38
100005371 2017-11-09 05:42:51
100005371 2017-11-09 09:46:57
100005371 2017-11-09 09:46:58
100005371 2017-11-09 09:46:58
100005371 2017-11-09 09:47:00
100005371 2017-11-09 11:37:26
100005391 2017-11-09 05:04:36
100005391 2017-11-09 05:04:41
10000541 2017-11-09 10:55:35
100005411 2017-11-08 22:55:49
100005411 2017-11-08 22:56:14

There are 5 unique ids here. I want to use the lag of click time and join it to each row where applicable. Obviously if there's just 1 row or it's the 1st row then there wont be any lag. Applying the above code should be a one liner. This what I tried.

DT[, c("last_click_time") := shift(.SD), by=unique_id]

This is following the syntax from the documentation and from links I've seen elsewhere such as here: How to create a lag variable within each group?

I get an error though that I can't figure out:

Error in `[.data.frame`(DT, , `:=`(c("last_click_time"), shift(.SD)),  : 
  unused argument (by = unique_id)

Even if I replace the timestamps to something simpler, the error persists.

DT$click_time = seq(1,2000,length.out = 19)

Any idea why this error is popping up? This is driving me mad!

EDIT: The answer to this question is I am an idiot. If you try this on a data.frame() object you will get the error I showed. If you convert to a data.table() then the code works as it should. Embarrassing but hopefully this post will be useful to someone else later on.

user137698
  • 291
  • 2
  • 12

2 Answers2

2

Your example seems to work for me, although I had to tidy the data a little to make sure it reads in correctly:

library( data.table )

DT <- fread( 'unique_id,click_time
100005361,2017-11-09 03:58:32
100005371,2017-11-09 00:53:08
100005371,2017-11-09 04:38:52
100005371,2017-11-09 05:42:30
100005371,2017-11-09 05:42:31
100005371,2017-11-09 05:42:31
100005371,2017-11-09 05:42:32
100005371,2017-11-09 05:42:38
100005371,2017-11-09 05:42:51
100005371,2017-11-09 09:46:57
100005371,2017-11-09 09:46:58
100005371,2017-11-09 09:46:58
100005371,2017-11-09 09:47:00
100005371,2017-11-09 11:37:26
100005391,2017-11-09 05:04:36
100005391,2017-11-09 05:04:41
10000541,2017-11-09 10:55:35
100005411,2017-11-08 22:55:49
100005411,2017-11-08 22:56:14' )

DT[, c("last_click_time") := shift(.SD), by=unique_id]

The result:

> head( DT )

   unique_id          click_time     last_click_time
1: 100005361 2017-11-09 03:58:32                  NA
2: 100005371 2017-11-09 00:53:08                  NA
3: 100005371 2017-11-09 04:38:52 2017-11-09 00:53:08
4: 100005371 2017-11-09 05:42:30 2017-11-09 04:38:52
5: 100005371 2017-11-09 05:42:31 2017-11-09 05:42:30
6: 100005371 2017-11-09 05:42:31 2017-11-09 05:42:31

I'm note sure whether it's best practice or not, but I personally avoid using .SD unless it's really necessary. Here you can go without. I'd use this, which should give an identical result to the above:

DT[, last_click_time := shift(click_time), by=unique_id]
rosscova
  • 5,430
  • 1
  • 22
  • 35
  • Thank you for the quick and helpful reply. I'm still getting an error which makes me think this is related other conflicting packages or my machine. I'm going to try to run the same code on a different machine and see if it works there. – user137698 May 11 '18 at 11:32
0

Your error indicates you are using data.frame rather than data.table, try it again after using as.data.table(your data)