Data frame "expand" procedure in R?

Question

This is not a real statistical question, but rather a data preparation question before performing the actual statistical analysis. I have a data frame which consists of sparse data. I would like to "expand" this data to include zeroes for missing values, group by group.

Here is an example of the data (a and b are two factors defining the group, t is the sparse timestamp and xis the value):

test <- data.frame(
    a=c(1,1,1,1,1,1,1,1,1,1,1),
    b=c(1,1,1,1,1,2,2,2,2,2,2),
    t=c(0,2,3,4,7,3,4,6,7,8,9),
    x=c(1,2,1,2,2,1,1,2,1,1,3))

Assuming I would like to expand the values between t=0 and t=9, this is the result I'm hoping for:

test.expanded <- data.frame(
    a=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
    b=c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2),
    t=c(0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9),
    x=c(1,0,2,1,2,0,0,2,0,0,0,0,0,1,1,0,2,1,1,3))

Zeroes have been inserted for all missing values of t. This makes it easier to use.

I have a quick and dirty implementation which sorts the dataframe and loops through each of its lines, adding missing lines one at a time. But I'm not entirely satisfied by the solution. Is there a better way to do it?

For those who are familiar with SAS, it is similar to the proc expand.

Thanks!

score 4 · Accepted Answer · answered Jan 31 '12 at 18:41

As you noted in a comment to the other answer, doing it by group is easy with plyr which just leaves how to "fill in" the data sets. My approach is to use merge.

library("plyr")

test.expanded <- ddply(test, c("a","b"), function(DF) {
  DF <- merge(data.frame(t=0:9), DF[,c("t","x")], all.x=TRUE)
  DF[is.na(DF$x),"x"] <- 0
  DF
})

merge with all.x=TRUE will make the missing values NA, so the second line of the function is needed to replace those NAs with 0's.

score 1 · Answer 2 · edited Jan 09 '23 at 13:15

1

This is convoluted but works fine:

test <- data.frame(
   a=c(1,1,1,1,1,1,1,1,1,1,1),
   b=c(1,1,1,1,1,2,2,2,2,2,2),
   t=c(0,2,3,4,7,3,4,6,7,8,9),
  x=c(1,2,1,2,2,1,1,2,1,1,3))
 
my.seq <- seq(0,9)
not.t <- !(my.seq %in% test$t)
test[nrow(test)+seq(length(my.seq[not.t])),"t"] <- my.seq[not.t]
test
#------------
    a  b t  x
1   1  1 0  1
2   1  1 2  2
3   1  1 3  1
4   1  1 4  2
5   1  1 7  2
6   1  2 3  1
7   1  2 4  1
8   1  2 6  2
9   1  2 7  1
10  1  2 8  1
11  1  2 9  3
12 NA NA 1 NA
13 NA NA 5 NA

Not sure if you want it sorted by t afterwards or not. If so, easy enough to do:

https://stackoverflow.com/a/6871968/636656

edited Jan 09 '23 at 13:15

IRTFM

258,963
21
364
487

answered Jan 31 '12 at 16:59

Ari B. Friedman

71,271
35
175
235

Hm. I see you want to group it by some other variable as well. Easy enough to extend this approach, but I'll wait and see if someone else comes up with the command that surely exists to do this in one step ;-) – Ari B. Friedman Jan 31 '12 at 17:02
Yes, in this example the output data frame is supposed to have 20 rows because I have two groups (a=1,b=1) (a=1,b=2). Thanks for your answer. – caas Jan 31 '12 at 17:06
Oh, turns out grouping is not really a problem thanks to package `plyr`: ddply(test, c("a","b"), expand_function), and I can define expand_function to something in the lines of what you proposed. – caas Jan 31 '12 at 17:10
plyr proves handy again :-). @BrianDiggs 's solution looks nice. – Ari B. Friedman Jan 31 '12 at 19:27

Data frame "expand" procedure in R?

2 Answers2