So i'm not sure it satisfies the "elegant" requirement, but here's a general purpose function you can use to get balanced data.
balanced<-function(data, ID, TIME, VARS, required=c("all","shared")) {
if(is.character(ID)) {
ID <- match(ID, names(data))
}
if(is.character(TIME)) {
TIME <- match(TIME, names(data))
}
if(missing(VARS)) {
VARS <- setdiff(1:ncol(data), c(ID,TIME))
} else if (is.character(VARS)) {
VARS <- match(VARS, names(data))
}
required <- match.arg(required)
idf <- do.call(interaction, c(data[, ID, drop=FALSE], drop=TRUE))
timef <- do.call(interaction, c(data[, TIME, drop=FALSE], drop=TRUE))
complete <- complete.cases(data[, VARS])
tbl <- table(idf[complete], timef[complete])
if (required=="all") {
keep <- which(rowSums(tbl==1)==ncol(tbl))
idx <- as.numeric(idf) %in% keep
} else if (required=="shared") {
keep <- which(colSums(tbl==1)==nrow(tbl))
idx <- as.numeric(timef) %in% keep
}
data[idx, ]
}
You can get your desired result with
balanced(unbal, "PERSON","YEAR")
# PERSON YEAR Y X
# 1 Frank 2001 21 1
# 2 Frank 2002 22 2
# 3 Frank 2003 23 3
# 4 Frank 2004 24 4
# 5 Frank 2005 25 5
# 11 Edward 2001 31 11
# 12 Edward 2002 32 12
# 13 Edward 2003 33 13
# 14 Edward 2004 34 14
# 15 Edward 2005 35 15
The first parameter is the data.frame you wish to subset. The second parameter (ID=
) is a character vector of column names that identify each "person" in the data set. Then the TIME=
parameter is also a character vector specifying the different observation times for each ID. Finally, you can optionally specify a VARS=
argument to specify which fields must be NA (defaults to all other than ID or TIME values). Finally, there is one last parameter named required
which states whether each ID must have an observation for every TIME (default) or if you set it to "shared", it will only return the TIMES that all IDs have non-missing values for.
So for example
balanced(unbal, "PERSON","YEAR", "X")
# PERSON YEAR Y X
# 1 Frank 2001 21 1
# 2 Frank 2002 22 2
# 3 Frank 2003 23 3
# 4 Frank 2004 24 4
# 5 Frank 2005 25 5
# 6 Tony 2001 5 6
# 7 Tony 2002 6 7
# 8 Tony 2003 NA 8
# 9 Tony 2004 7 9
# 10 Tony 2005 8 10
# 11 Edward 2001 31 11
# 12 Edward 2002 32 12
# 13 Edward 2003 33 13
# 14 Edward 2004 34 14
# 15 Edward 2005 35 15
only requires that "X" be NA for all PERSON/YEARS and since this is true for all records, no sub setting takes place.
If you do
balanced(unbal, "PERSON","YEAR", required="shared")
# PERSON YEAR Y X
# 1 Frank 2001 21 1
# 2 Frank 2002 22 2
# 4 Frank 2004 24 4
# 5 Frank 2005 25 5
# 6 Tony 2001 5 6
# 7 Tony 2002 6 7
# 9 Tony 2004 7 9
# 10 Tony 2005 8 10
# 11 Edward 2001 31 11
# 12 Edward 2002 32 12
# 14 Edward 2004 34 14
# 15 Edward 2005 35 15
then you get the data for years 2001, 2002, 2004, 2005 for ALL persons since they all have data for those years.
Now let use create a slightly different sample data set
unbal2 <- unbal
unbal2[15, 2] <- 2006
tail(unbal2)
# PERSON YEAR Y X
# 10 Tony 2005 8 10
# 11 Edward 2001 31 11
# 12 Edward 2002 32 12
# 13 Edward 2003 33 13
# 14 Edward 2004 34 14
# 15 Edward 2006 35 15
Notice now that Edward is the only person that has a value for 2006. This means that
balanced(unbal2, "PERSON","YEAR")
# [1] PERSON YEAR Y X
# <0 rows> (or 0-length row.names)
now returns nothing but
balanced(unbal2, "PERSON","YEAR", required="shared")
# PERSON YEAR Y X
# 1 Frank 2001 21 1
# 2 Frank 2002 22 2
# 4 Frank 2004 24 4
# 6 Tony 2001 5 6
# 7 Tony 2002 6 7
# 9 Tony 2004 7 9
# 11 Edward 2001 31 11
# 12 Edward 2002 32 12
# 14 Edward 2004 34 14
will return the data for 2001,2002, 2004 since all persons have data for those years.