I am trying to calculated the lagged difference (or actual increase) for data that has been inadvertently aggregated. Each successive year in the data includes values from the previous year. A sample data set can be created with this code:
set.seed(1234)
x <- data.frame(id=1:5, value=sample(20:30, 5, replace=T), year=3)
y <- data.frame(id=1:5, value=sample(10:19, 5, replace=T), year=2)
z <- data.frame(id=1:5, value=sample(0:9, 5, replace=T), year=1)
(df <- rbind(x, y, z))
I can use a combination of lapply()
and split()
to calculate the difference between each year for every unique id, like so:
(diffs <- lapply(split(df, df$id), function(x){-diff(x$value)}))
However, because of the nature of the diff()
function, there are no results for the values in year 1, which means that after I flatten the diffs
list of lists with Reduce()
, I cannot add the actual yearly increases back into the data frame, like so:
df$actual <- Reduce(c, diffs) # flatten the list of lists
In this example, there are only 10 calculated differences or lags, while there are 15 rows in the data frame, so R throws an error when trying to add a new column.
How can I create a new column of actual increases with (1) the values for year 1 and (2) the calculated diffs/lags for all subsequent years?
This is the output I'm eventually looking for. My diffs
list of lists calculates the actual values for years 2 and 3 just fine.
id value year actual
1 21 3 5
2 26 3 16
3 26 3 14
4 26 3 10
5 29 3 14
1 16 2 10
2 10 2 5
3 12 2 10
4 16 2 7
5 15 2 13
1 6 1 6
2 5 1 5
3 2 1 2
4 9 1 9
5 2 1 2