4

I am working on panel data with a unique case identifier and a column for the time points of the observations (long format). There are both time-constant variables and time-varying observations:

    id  time    tc1     obs1
1   101 1       male    4
2   101 2       male    5
3   101 3       male    3
4   102 1       female  6
5   102 3       female  2
6   103 1       male    2

For my model I now need data with complete records per id for each time point. In other words, if an observation is missing I still need to put in a row with id, time, time-constant variables, and NA for the observed variables (as would be the line (102, 2, "female", NA) in the above example). So my question is:

  1. How can I find out if a row with unique combination of id and time already exists in my dataset?
  2. If not, how can I add this row, carry over time-constant variables and fill the observations with NA?

Would be great if someone could shed some light on this.

Thanks a lot in advance!


EDIT

Thank you everyone for your replies. Here is what I finally did, which is mix of several suggested approaches. The thing is that I have several time-varying variables (obs1-obsn) per row and I did not get dcast to accomodate for that - value.name does not take more than argument.

# create all possible permutations of id and year
iddat = expand.grid(id = unique(dataset$id), time = (c(1996,1999,2002,2005,2008,2011)))
iddat <- iddat[order(iddat$id, iddat$time), ]

# add permutations to existing data, combinations so far missing are NA
dataset_new <- merge(dataset, iddat, all.x=TRUE, all.y=TRUE, by=c("id", "time"))

# drop time-constant variables from data
dataset_new[c("tc1", "tc2", "tc3")] <- list(NULL)

# merge back time-constant variables from original data
temp <- dataset[c("tc1", "tc2", "tc3")]
dataset_new <- merge(dataset_new, temp, by=c("id"))

# sort
dataset_new <- dataset_new[order(dataset_new$id, dataset_new$time), ]
dataset_new <- unique(dataset_new) # some rows are duplicates after last merge, no idea why

rm(temp)
rm(iddat)

All the best and thanks again, Matt

Jaap
  • 81,064
  • 34
  • 182
  • 193
matt
  • 41
  • 3
  • You should include the original data to make your problem easy to replicate -- take a look at `dput` – AndrewMacDonald Jun 26 '14 at 18:22
  • 1
    But basically, if you have a list of id and time that *should* be there, you can just merge that with this data using `merge`. You'll get NA for those ids and times that don't exist – AndrewMacDonald Jun 26 '14 at 18:24
  • 1
    And if you need to make a dataset of all possible id/time/tc1 combinations to use with `merge`, you can use `expand.grid` with the appropriate info from your dataset: `iddat = expand.grid(id = unique(dat$id), time = unique(dat$time), tc1 = unique(dat$tc1))`. – aosmith Jun 26 '14 at 18:41
  • 1
    There are functions `make.pbalanced` and `make.pconsecutive` which might be handy for such situations in the development version of package `plm` (r-forge.r-project.org/R/?group_id=406). – Helix123 Jun 28 '16 at 19:45
  • Possible duplicate of [R elegant way to balance unbalanced panel data](http://stackoverflow.com/questions/25671246/r-elegant-way-to-balance-unbalanced-panel-data) – Michael Ohlrogge Aug 20 '16 at 21:08

2 Answers2

2

There are probably more elegant ways, but here's one option. I'm assuming that you need all combinations of id and time but not tc1 (i.e. tc1 is tied to id).

# your data
df <- read.table(text = "    id  time    tc1     obs1
1   101 1       male    4
2   101 2       male    5
3   101 3       male    3
4   102 1       female  6
5   102 3       female  2
6   103 1       male    2", header = TRUE)

First cast your data to wide format to introduce NAs, then convert back to long.

library('reshape2')

df_wide <- dcast(
  df, 
  id + tc1 ~ time,
  value.var = "obs1", 
  fill = NA
)

df_long <- melt(
  df_wide, 
  id.vars = c("id","tc1"), 
  variable.name = "time",
  value.name = "obs1"
)

# sort by id and then time
df_long[order(df_long$id, df_long$time), ]
   id    tc1 time obs1
1 101   male    1    4
4 101   male    2    5
7 101   male    3    3
2 102 female    1    6
5 102 female    2   NA
8 102 female    3    2
3 103   male    1    2
6 103   male    2   NA
9 103   male    3   NA
Kara Woo
  • 3,595
  • 19
  • 31
  • 1
    You can keep `tc1` when reshaping the dataset by using `id + tc1 ~ time` in `dcast` and then `id.vars = c("id","tc1")` and `variable.name = "time"` in `melt`. – eipi10 Jun 26 '14 at 20:47
  • @eipi10 Aha, there we go. Thanks for the comment; I've edited my answer accordingly. – Kara Woo Jun 26 '14 at 20:58
2

You could create an empty dataset and then merge in the records in which you have matches.

 # Create dataset.  For you actual data ,you would replace c(1:3) with 
 # c(1:max(yourdata$id)) and adjust the number of time periods to match your data.
 id <- rep(c(1:3), each = 3)
 time <- rep(c(1:3), 3)
 df <- data.frame(id,time)


 test <- df[c(1,3,5,7,9),]
 test$tc1 <- c("male", "male", "female", "male", "male")
 test$obs1 <-c(4,5,3,6,2)

 merge(df, test, by.x = c("id","time"), by.y = c("id","time"), all.x = TRUE)

The result:

 id time    tc1 obs1
 1  1    1   male    4
 2  1    2   <NA>   NA
 3  1    3   male    5
 4  2    1   <NA>   NA
 5  2    2 female    3
 6  2    3   <NA>   NA
 7  3    1   male    6
 8  3    2   <NA>   NA
 9  3    3   male    2
kng229
  • 473
  • 5
  • 13
  • 1
    A generalized version of @kng229's answer that can handle factor ids and automatically set time ranges: `id_list <- sort(rep(unique(df$id), max(df$time)-min(df$time) +1)) time_list <- rep(min(df$time):max(df$time)) empty_panel <- data.frame(id_list,time_list)` – Bryan Nov 23 '14 at 02:53