6

I am trying to calculated the lagged difference (or actual increase) for data that has been inadvertently aggregated. Each successive year in the data includes values from the previous year. A sample data set can be created with this code:

set.seed(1234)
x <- data.frame(id=1:5, value=sample(20:30, 5, replace=T), year=3)
y <- data.frame(id=1:5, value=sample(10:19, 5, replace=T), year=2)
z <- data.frame(id=1:5, value=sample(0:9, 5, replace=T), year=1)
(df <- rbind(x, y, z))

I can use a combination of lapply() and split() to calculate the difference between each year for every unique id, like so:

(diffs <- lapply(split(df, df$id), function(x){-diff(x$value)}))

However, because of the nature of the diff() function, there are no results for the values in year 1, which means that after I flatten the diffs list of lists with Reduce(), I cannot add the actual yearly increases back into the data frame, like so:

df$actual <- Reduce(c, diffs)  # flatten the list of lists

In this example, there are only 10 calculated differences or lags, while there are 15 rows in the data frame, so R throws an error when trying to add a new column.

How can I create a new column of actual increases with (1) the values for year 1 and (2) the calculated diffs/lags for all subsequent years?

This is the output I'm eventually looking for. My diffs list of lists calculates the actual values for years 2 and 3 just fine.

id value year actual
 1    21    3      5
 2    26    3     16
 3    26    3     14
 4    26    3     10
 5    29    3     14
 1    16    2     10
 2    10    2      5
 3    12    2     10
 4    16    2      7
 5    15    2     13
 1     6    1      6
 2     5    1      5
 3     2    1      2
 4     9    1      9
 5     2    1      2
Andrew
  • 36,541
  • 13
  • 67
  • 93

3 Answers3

4

I think this will work for you. When you run into the diff problem just lengthen the vector by putting 0 in as the first number.

df <- df[order(df$id, df$year), ]
sdf <-split(df, df$id)
df$actual <- as.vector(sapply(seq_along(sdf), function(x) diff(c(0, sdf[[x]][,2]))))
df[order(as.numeric(rownames(df))),]

There's lots of ways to do this but this one is fairly fast and uses base.

Here's a second & third way of approaching this problem utilizing aggregate and by:

aggregate:

df <- df[order(df$id, df$year), ]
diff2 <- function(x) diff(c(0, x))
df$actual <- c(unlist(t(aggregate(value~id, df, diff2)[, -1])))
df[order(as.numeric(rownames(df))),]

by:

df <- df[order(df$id, df$year), ]
diff2 <- function(x) diff(c(0, x))
df$actual <- unlist(by(df$value, df$id, diff2))
df[order(as.numeric(rownames(df))),]

plyr

df <- df[order(df$id, df$year), ]
df <- data.frame(temp=1:nrow(df), df)
library(plyr)
df <- ddply(df, .(id), transform, actual=diff2(value))
df[order(-df$year, df$temp),][, -1]

It gives you the final product of:

> df[order(as.numeric(rownames(df))),]
   id value year actual
1   1    21    3      5
2   2    26    3     16
3   3    26    3     14
4   4    26    3     10
5   5    29    3     14
6   1    16    2     10
7   2    10    2      5
8   3    12    2     10
9   4    16    2      7
10  5    15    2     13
11  1     6    1      6
12  2     5    1      5
13  3     2    1      2
14  4     9    1      9
15  5     2    1      2

EDIT: Avoiding the Loop

May I suggest avoiding the loop and turning what I gave to you into a function (the by solution is the easiest one for me to work with) and sapply that to the two columns you desire.

set.seed(1234)  #make new data with another numeric column
x <- data.frame(id=1:5, value=sample(20:30, 5, replace=T), year=3)
y <- data.frame(id=1:5, value=sample(10:19, 5, replace=T), year=2)
z <- data.frame(id=1:5, value=sample(0:9, 5, replace=T), year=1)
df <- rbind(x, y, z)
df <- df.rep <- data.frame(df[, 1:2], new.var=df[, 2]+sample(1:5, nrow(df), 
          replace=T), year=df[, 3])


df <- df[order(df$id, df$year), ]
diff2 <- function(x) diff(c(0, x))                   #function one
group.diff<- function(x) unlist(by(x, df$id, diff2)) #answer turned function
df <- data.frame(df, sapply(df[, 2:3], group.diff))  #apply group.diff to col 2:3
df[order(as.numeric(rownames(df))),]                 #reorder it

Of course you'd have to rename these unless you used transform as in:

df <- df[order(df$id, df$year), ]
diff2 <- function(x) diff(c(0, x))                   #function one
group.diff<- function(x) unlist(by(x, df$id, diff2)) #answer turned function
df <- transform(df, actual=group.diff(value), actual.new=group.diff(new.var))   
df[order(as.numeric(rownames(df))),]

This would depend on how many variables you were doing this to.

Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
  • Weird. My `set.seed()` didn't work apparently. I've updated it with the real numbers. – Andrew Mar 04 '12 at 06:52
  • This is fantastic! Is there an easy way to make any of these functions work for an arbitrary number of columns, like if there were 2+ misaggregated variables: `x <- data.frame(id=1:5, value=sample(20:30, 5, replace=T), value1=sample(20:30, 5, replace=T), year=3)`, etc.? – Andrew Mar 04 '12 at 17:19
  • I mean, I'm trying to do is run the same `as.vector(sapply(seq_along(...` function for multiple columns at a time (not just 2). I'm essentially trying to run that function with an `lapply` on the range of columns (2:x). – Andrew Mar 04 '12 at 17:30
  • If you wanted to do this to multiple columns the plyr solution is likely the easiest approach. Simply add it to the ddply as in `df <- ddply(df, .(id), transform, actual=diff2(value), NEW = diff2(EXTRA_COLUMN))` – Tyler Rinker Mar 04 '12 at 17:37
  • 1
    To stick with base, I was incredibly lazy and used a loop: `for (i in 2:3) { df[,i] <- as.vector(sapply(seq_along(sdf), function(x) diff(c(0, sdf[[x]][,i])))) }`. – Andrew Mar 04 '12 at 22:07
  • 1
    @Andrew because R is my first language my mind doesn't go to loops readily. It's actually a last resort because they're not intuitive to me. So a loop actually is not as easy for me. If you want to keep it in base you could turn what I gave you above into a function and `sapply` or `transform` with that to generate the dataframe you want. See my edit for an example. – Tyler Rinker Mar 04 '12 at 23:12
  • Awesome—thanks for the edits! I come from Perl, Python, and PHP, so I normally live with loops, and it's been hard to stop myself from using them in R. – Andrew Mar 04 '12 at 23:55
3

1) diff.zoo. With the zoo package its just a matter of converting it to zoo using split= and then performing the diff :

library(zoo)

zz <- zz0 <- read.zoo(df, split = "id", index = "year", FUN = identity)
zz[2:3, ] <- diff(zz)

It gives the following (in wide form rather than the long form you mentioned) where each column is an id and each row is a year minus the prior year:

> zz
   1  2  3  4  5
1  6  5  2  9  2
2 10  5 10  7 13
3  5 16 14 10 14

The wide form shown may actually be preferable but you can convert it to long form if you want that like this:

dt <- function(x) as.data.frame.table(t(x))
setNames(cbind(dt(zz), dt(zz0)[3]), c("id", "year", "value", "actual"))

This puts the years in ascending order which is the convention normally used in R.

2) rollapply. Also using zoo this alternative uses a rolling calculation to add the actual column to your data. It assumes the data is structured as you show with the same number of years in each group arranged in order:

df$actual <- rollapply(df$value, 6, partial = TRUE, align = "left",
   FUN = function(x) if (length(x) < 6) x[1] else x[1]-x[6])

3) subtraction. Making the same assumptions as in the prior solution we can further simplify it to just this which subtracts from each value the value 5 positions hence:

transform(df, actual = value - c(tail(value, -5), rep(0, 5)))

or this variation:

transform(df, actual = replace(value, year > 1, -diff(ts(value), 5)))

EDIT: added rollapply and subtraction solutions.

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
1

Kind of hackish but keeping in place your wonderful Reduce you could add mock rows to your df for year 0:

mockRows <- data.frame(id = 1:5, value = 0, year = 0)
(df <- rbind(df, mockRows))
(df <- df[order(df$id, df$year), ])

(diffs <- lapply(split(df, df$id), function(x){diff(x$value)}))
(df <- df[df$year != 0,])

(df$actual <- Reduce(c, diffs)) # flatten the list of lists
df[order(as.numeric(rownames(df))),]

This is the output:

   id value year actual
1   1    21    3      5
2   2    26    3     16
3   3    26    3     14
4   4    26    3     10
5   5    29    3     14
6   1    16    2     10
7   2    10    2      5
8   3    12    2     10
9   4    16    2      7
10  5    15    2     13
11  1     6    1      6
12  2     5    1      5
13  3     2    1      2
14  4     9    1      9
15  5     2    1      2
mbask
  • 2,471
  • 18
  • 17