Create missing observations in panel data

Question

I am working on panel data with a unique case identifier and a column for the time points of the observations (long format). There are both time-constant variables and time-varying observations:

    id  time    tc1     obs1
1   101 1       male    4
2   101 2       male    5
3   101 3       male    3
4   102 1       female  6
5   102 3       female  2
6   103 1       male    2

For my model I now need data with complete records per id for each time point. In other words, if an observation is missing I still need to put in a row with id, time, time-constant variables, and NA for the observed variables (as would be the line (102, 2, "female", NA) in the above example). So my question is:

How can I find out if a row with unique combination of id and time already exists in my dataset?
If not, how can I add this row, carry over time-constant variables and fill the observations with NA?

Would be great if someone could shed some light on this.

Thanks a lot in advance!

EDIT

Thank you everyone for your replies. Here is what I finally did, which is mix of several suggested approaches. The thing is that I have several time-varying variables (obs1-obsn) per row and I did not get dcast to accomodate for that - value.name does not take more than argument.

# create all possible permutations of id and year
iddat = expand.grid(id = unique(dataset$id), time = (c(1996,1999,2002,2005,2008,2011)))
iddat <- iddat[order(iddat$id, iddat$time), ]

# add permutations to existing data, combinations so far missing are NA
dataset_new <- merge(dataset, iddat, all.x=TRUE, all.y=TRUE, by=c("id", "time"))

# drop time-constant variables from data
dataset_new[c("tc1", "tc2", "tc3")] <- list(NULL)

# merge back time-constant variables from original data
temp <- dataset[c("tc1", "tc2", "tc3")]
dataset_new <- merge(dataset_new, temp, by=c("id"))

# sort
dataset_new <- dataset_new[order(dataset_new$id, dataset_new$time), ]
dataset_new <- unique(dataset_new) # some rows are duplicates after last merge, no idea why

rm(temp)
rm(iddat)

All the best and thanks again, Matt

You should include the original data to make your problem easy to replicate -- take a look at `dput` — AndrewMacDonald, Jun 26 '14 at 18:22
But basically, if you have a list of id and time that *should* be there, you can just merge that with this data using `merge`. You'll get NA for those ids and times that don't exist — AndrewMacDonald, Jun 26 '14 at 18:24
And if you need to make a dataset of all possible id/time/tc1 combinations to use with `merge`, you can use `expand.grid` with the appropriate info from your dataset: `iddat = expand.grid(id = unique(dat$id), time = unique(dat$time), tc1 = unique(dat$tc1))`. — aosmith, Jun 26 '14 at 18:41
There are functions `make.pbalanced` and `make.pconsecutive` which might be handy for such situations in the development version of package `plm` (r-forge.r-project.org/R/?group_id=406). — Helix123, Jun 28 '16 at 19:45
Possible duplicate of [R elegant way to balance unbalanced panel data](http://stackoverflow.com/questions/25671246/r-elegant-way-to-balance-unbalanced-panel-data) — Michael Ohlrogge, Aug 20 '16 at 21:08

Kara Woo · Answer 1 · 2014-06-26T20:57:28.783

There are probably more elegant ways, but here's one option. I'm assuming that you need all combinations of id and time but not tc1 (i.e. tc1 is tied to id).

# your data
df <- read.table(text = "    id  time    tc1     obs1
1   101 1       male    4
2   101 2       male    5
3   101 3       male    3
4   102 1       female  6
5   102 3       female  2
6   103 1       male    2", header = TRUE)

First cast your data to wide format to introduce NAs, then convert back to long.

library('reshape2')

df_wide <- dcast(
  df, 
  id + tc1 ~ time,
  value.var = "obs1", 
  fill = NA
)

df_long <- melt(
  df_wide, 
  id.vars = c("id","tc1"), 
  variable.name = "time",
  value.name = "obs1"
)

# sort by id and then time
df_long[order(df_long$id, df_long$time), ]
   id    tc1 time obs1
1 101   male    1    4
4 101   male    2    5
7 101   male    3    3
2 102 female    1    6
5 102 female    2   NA
8 102 female    3    2
3 103   male    1    2
6 103   male    2   NA
9 103   male    3   NA

You can keep `tc1` when reshaping the dataset by using `id + tc1 ~ time` in `dcast` and then `id.vars = c("id","tc1")` and `variable.name = "time"` in `melt`. — eipi10, Jun 26 '14 at 20:47
@eipi10 Aha, there we go. Thanks for the comment; I've edited my answer accordingly. — Kara Woo, Jun 26 '14 at 20:58

kng229 · Answer 2 · 2014-06-26T20:18:34.077

You could create an empty dataset and then merge in the records in which you have matches.

 # Create dataset.  For you actual data ,you would replace c(1:3) with 
 # c(1:max(yourdata$id)) and adjust the number of time periods to match your data.
 id <- rep(c(1:3), each = 3)
 time <- rep(c(1:3), 3)
 df <- data.frame(id,time)


 test <- df[c(1,3,5,7,9),]
 test$tc1 <- c("male", "male", "female", "male", "male")
 test$obs1 <-c(4,5,3,6,2)

 merge(df, test, by.x = c("id","time"), by.y = c("id","time"), all.x = TRUE)

The result:

 id time    tc1 obs1
 1  1    1   male    4
 2  1    2   <NA>   NA
 3  1    3   male    5
 4  2    1   <NA>   NA
 5  2    2 female    3
 6  2    3   <NA>   NA
 7  3    1   male    6
 8  3    2   <NA>   NA
 9  3    3   male    2

A generalized version of @kng229's answer that can handle factor ids and automatically set time ranges: `id_list <- sort(rep(unique(df$id), max(df$time)-min(df$time) +1)) time_list <- rep(min(df$time):max(df$time)) empty_panel <- data.frame(id_list,time_list)` — Bryan, Nov 23 '14 at 02:53

Create missing observations in panel data

EDIT

2 Answers2