0

I have a question about creating lag variables depending on a time factor.

Basically I am working with a baseball dataset where there are lots of names for each player between 2002-2012. Obviously I only want lag variables for the same person to try and create a career arc to predict the current stat. Like for example I want to use lag 1 Average (2003) , lag 2 Average (2004) to try and predict the current average in 2005. So I tried to write a loop that goes through every row (the data frame is already sorted by name and then year, so the previous year is n-1 row), check if the name is the same, and if so then grab the value from the previous row.

Here is my loop:

i=2 #as 1 errors out with 1-0 row
for(i in 2:6264){
if(TS$name[i]==TS$name[i-1]){
TS$runvalueL1[i]=TS$Run_Value[i-1]
}else{
TS$runvalueL1 <- NA
}
i=i+1
}

Because each row is dependent on the name I cannot use most of the lag functions. If you have a better idea I am all ears!

Sample Data won't help a bunch but here is some:

edit: Sample data wasn't producing useable results so I just attached the first 10 people of my dataset. Thanks!

TS[(6:10),c('name','Season','Run_Value')]
               name     Season    ARuns
321           Abad Andy   2003     -1.05
3158 Abercrombie Reggie   2006     27.42
1312 Abercrombie Reggie   2007      7.65
1069 Abercrombie Reggie   2008      5.34
4614    Abernathy Brent   2002     46.71
707     Abernathy Brent   2003     -2.29
1297    Abernathy Brent   2005      5.59
6024        Abreu Bobby   2002    102.89
6087        Abreu Bobby   2003    113.23
6177        Abreu Bobby   2004    128.60

Thank you!

BaseballR
  • 147
  • 2
  • 12

3 Answers3

3

Smth along these lines should do it:

names = c("Adams","Adams","Adams","Adams","Bobby","Bobby", "Charlie")
years = c(2002,2003,2004,2005,2004,2005,2010)
Run_value = c(10,15,15,20,10,5,5)

library(data.table)
dt = data.table(names, years, Run_value)

dt[, lag1 := c(NA, Run_value), by = names]
#     names years Run_value lag1
#1:   Adams  2002        10   NA
#2:   Adams  2003        15   10
#3:   Adams  2004        15   15
#4:   Adams  2005        20   15
#5:   Bobby  2004        10   NA
#6:   Bobby  2005         5   10
#7: Charlie  2010         5   NA
eddi
  • 49,088
  • 6
  • 104
  • 155
  • I like the idea but it will not work with the actual data I am using, so I attached first 10 entries of actual data with the columns used.(Going to create lag variables for many columns but that's easy to change once one works) – BaseballR May 22 '13 at 21:51
  • I am just getting errors. Can I just use large dataset instead of dt as it is formatted well so: TS[, lag1 := c(NA, TS$Run_Value), by = TS$name] and I get this error: [57] ERROR: =`(lag1, c(NA, TS$Run_Value)), by = TS$name) – BaseballR May 22 '13 at 22:00
  • @BaseballR you need a `data.table` to run the above code, so convert your `TS` first, e.g. by running `dt = data.table(TS)` – eddi May 22 '13 at 22:03
  • You are awesome! I feel like I tried that but alas I guess not! Next question, if I wanted to get the second lag (-2 years) how would I go about that? – BaseballR May 22 '13 at 22:12
  • Also, how do I access all of the data? After I do that I would like to use function dt[(1:10),c('name','Season','Run_Value','lag1')] to see the first 10 rows of data but it does not let me. How do I take it out of a dataframe or manipulate it while it is in there? – BaseballR May 22 '13 at 22:19
  • for more lags, do e.g. `dt[, lag2 := c(NA,NA,Run_value), by = names]`; you might want to take a look at http://datatable.r-forge.r-project.org/datatable-faq.pdf and http://datatable.r-forge.r-project.org/datatable-intro.pdf to get acquainted with the package - it's a small time investment with very large pay-offs :); to answer your question, to see the first 10 rows of your data type `dt[1:10]` - also, try just typing `dt` - you'll be pleasantly surprised – eddi May 22 '13 at 22:25
  • Thank you again, one last question as I read this document (apologies if it is in the document) is there a way to see only certain columns like the dt[(1:10),c('name','Season','Run_Value','lag1')] type thing would do. It's just that I have over 90 columns with a bunch of variables and just annoying to do dt[1:10] and have to comb through to see what I want! You guys are the best! – BaseballR May 22 '13 at 22:37
  • sure, `dt[1:10, list(name, Season, Run_value, lag1)]` – eddi May 22 '13 at 22:39
0

An alternative would be to split the data by name, use lapply with the lag function of your choice and then combine the splitted data again:

TS$runvalueL1 <- do.call("rbind", lapply(split(TS, list(TS$name)), your_lag_function))

or

TS$runvalueL1 <- do.call("c", lapply(split(TS, list(TS$name)), your_lag_function))

But I guess there is also a nice possibility with plyr, but as you did not provide a reproducible example, that is all for the beginning.

Better:

TS$runvalueL1 <- unlist(lapply(split(TS, list(TS$name)), your_lag_function))
Community
  • 1
  • 1
Henrik
  • 14,202
  • 10
  • 68
  • 91
0

This is obviously not a problem where you want to create a matrix with cbind, so this is a better data structure:

full=data.frame(names, years, Run_value)

The ave function is quite useful for constructing new columns within categories of other columns:

full$Lag1 <- ave(full$Run_value, full$names, 
          FUN= function(x) c(NA, x[-length(x)] )  )
full
    names years Run_value Lag1
1   Adams  2002        10   NA
2   Adams  2003        15   10
3   Adams  2004        15   15
4   Adams  2005        20   15
5   Bobby  2004        10   NA
6   Bobby  2005         5   10
7 Charlie  2010         5   NA

I thinks it's safer to cionstruct with NA, since that will help prevent errors in logic that using 0 for prior years in year 1 would not alert you to.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • I like the idea but it will not work with the actual data I am using, so I attached first 10 entries of actual data with the columns used.(Going to create lag variables for many columns but that's easy to change once one works) – BaseballR May 22 '13 at 21:51