1

First question on here, so hopefully I've done this correctly!

I have a large dataset, the following is a small sample:

id <- c(1, 1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 6)
year <- c("2010", "2011", "2012", "2014", "2012", "2013", "2011", "2012", "2013", "2010", "2011", "2012", "2013", "2008", "2009", "2011")
value <- c(100, 33, 80, 90, 80, 100, 100, 90, 80, 90, 80, 100, 100, 90, 80, 99)

df <- data.frame(id, year, value)
df

For each id I want to return the values of two successive years so that I can compare the value in year n to year n+1. Where there are not two successive years then don't return anything for that id.

The output should be as follows:

id <- c(1, 1, 2, 3, 3, 4, 4, 4, 5)
year <- c("2010", "2011", "2012", "2011", "2012", "2010", "2011", "2012", "2008")
yvalue <- c(100, 33, 80, 100, 90, 90, 80, 100, 90)
yearadd1 <- c("2011", "2012", "2013", "2012", "2013", "2011", "2012", "2013", "2009")
valueadd1 <- c(33, 80, 100, 90, 80, 80, 100, 100, 80)

df <- data.frame(id, year, yvalue, yearadd1, valueadd1)
df

How do I get r to give me this output? The main difficulty I face is that for id = 1 the first pair of successive years are 2010 and 2011, whereas for id = 4 they are 2008 and 2009, so I can't define what the first year is as it varies by id.

cookie
  • 15
  • 3

1 Answers1

0

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'id', we loop through the columns 'year', 'value' and get the lead observation using shift, assign (:=) it to new columns and remove the NA rows (na.omit). Then get the row ids (.I) where the successive elements in 'yearadd1' is equal to 1, and extract those rows.

library(data.table)
nm1 <- names(df)[2:3]
dt <- na.omit(setDT(df)[,  paste0(nm1, "add1") :=  lapply(.SD, shift, type = "lead"),
       by = id, .SDcols = nm1])
dt[dt[, .I[c(TRUE, diff(as.numeric(as.character(yearadd1)))==1)], id]$V1]
#   id year value yearadd1 valueadd1
#1:  1 2010   100     2011        33
#2:  1 2011    33     2012        80
#3:  2 2012    80     2013       100
#4:  3 2011   100     2012        90
#5:  3 2012    90     2013        80
#6:  4 2010    90     2011        80
#7:  4 2011    80     2012       100
#8:  4 2012   100     2013       100
#9:  5 2008    90     2009        80
akrun
  • 874,273
  • 37
  • 540
  • 662
  • My data's source is actually in a spreadsheet, so I don't need to convert it to a data.table (I presented it in a data.frame so you could see the data). So with that in mind I assume I simply don't need the "SetDT" piece of that code? – cookie Jul 30 '16 at 20:05
  • @cookie I converted to `data.table` to use the data.table methods which would be fast. Also, if you still need a data.frame with the output `out <- dt[dt[, .I[c(TRUE, .....; setDF(out)` converts to data.frame. – akrun Jul 30 '16 at 20:07
  • Let's say I have another column in my dataset called "class". How would you amend your code so that the final output only shows the rows where the "class" is the same in both years? So for id = 5 if 2008 had class = a and 2009's class = b then don't show that row? – cookie Jul 30 '16 at 21:22
  • @cookie Can you please post that as a new question as it is not that clear – akrun Jul 31 '16 at 03:28
  • You did a check to ensure that only rows with successive years were extracted. I just want to introduce another criteria where the rows are only extracted where "class" is also the same in the two successive years – cookie Jul 31 '16 at 07:25
  • @cookie In the data you showed, there is no `class` variable. – akrun Jul 31 '16 at 07:29
  • I posted a new question: http://stackoverflow.com/questions/38682488/return-values-for-successive-years-where-criteria-is-met – cookie Jul 31 '16 at 14:25