I am using a 'shares model' to estimate values for missing observations. With the example data set my.data
I am filling missing observations for each of three years proportionally to how observations were distributed in 1970 (although I could do so using 2010 or both 1970 and 2010).
Below I present example data, the desired result and code to obtain desired estimates in two ways. Code for the first approach is very specific to this example. I wish to create a more-general function than that used in the second approach. Creating a more-general function seems to me to require calling a function on a list of lists. I am hoping someone might offer advice on how to apply a function to a list of lists.
Here is the example data set and the highly specific solution:
my.data <- read.table(text = '
county y1970 y1980 y1990 y2000 y2010
aa 50 NA 70 NA 500
cc 10 20 NA NA 100
ee 800 NA NA 400 8000
gg 1000 1900 NA NA 10000
ii 200 400 300 100 2000
kk 20 40 30 NA 200
', header = TRUE, na.string='NA', stringsAsFactors=FALSE)
my.total <- read.table(text = '
county y1970 y1980 y1990 y2000 y2010
total 2080 4000 3000 1000 20800
', header = TRUE, na.string='NA', stringsAsFactors=FALSE)
desired.result <- read.table(text = '
county y1970 y1980 y1990 y2000 y2010
aa 50 96.47059 70 23.148148 500
cc 10 20 14.36464 4.629630 100
ee 800 1543.529 1149.17127 400 8000
gg 1000 1900 1436.46409 462.962963 10000
ii 200 400 300 100 2000
kk 20 40 30 9.259259 200
', header = TRUE, na.string='NA', stringsAsFactors=FALSE)
x70 <- c(50, 800)
estimates.for.80 <- (x70 / sum(x70)) * (my.total$y1980 - sum(my.data$y1980, na.rm = TRUE))
x80 <- c(10, 800, 1000)
estimates.for.90 <- (x80 / sum(x80)) * (my.total$y1990 - sum(my.data$y1990, na.rm = TRUE))
x90 <- c(50, 10, 1000, 20)
estimates.for.00 <- (x90 / sum(x90)) * (my.total$y2000 - sum(my.data$y2000, na.rm = TRUE))
Here is the function. I think this can be generalized if I knew how to include d.counties
as an input list to the function. In other words, how can I include d.counties
in my.input
and still have the function work? My confusion I think stems from the length of d.counties
differing among years.
state <- 'my.state'
my.df <- read.table(text = '
county y1970 y1980 y1990 y2000 y2010
aa 50 NA 70 NA 500
cc 10 20 NA NA 100
ee 800 NA NA 400 8000
gg 1000 1900 NA NA 10000
ii 200 400 300 100 2000
kk 20 40 30 NA 200
total 2080 4000 3000 1000 20800
', header = TRUE, na.string='NA', stringsAsFactors=FALSE)
pre.divide.up <- tail(my.df[,2:ncol(my.df)], 1) - colSums(head(my.df[,2:ncol(my.df)], -1), na.rm = TRUE)
# For each column containing NA's define the years to use as shares
# If use.years = 'pre' then use the year in pre.year
# If use.years = 'post' then use the year in post.year
# If use.years = 'both' then use both the year in pre.year and the year in post.year
#
# Here I define pre.year = y1970 and post.year = 2010 for every year
# However, 'pre.year' and 'post.year' are variables. They can differ among rows below.
shares <- read.table(text = '
cyear pre.year post.year use.years
y1980 y1970 y2010 pre
y1990 y1970 y2010 pre
y2000 y1970 y2010 pre
', header = TRUE, na.strings = "NA")
d.counties.80 <- c( 'aa' ,
'ee' )
d.counties.90 <- c( 'cc' ,
'ee' ,
'gg' )
d.counties.00 <- c( 'aa' ,
'cc' ,
'gg' ,
'kk' )
d.counties <- list(d.counties.80, d.counties.90, d.counties.00)
my.input <- data.frame(shares)
my.function <- function(y) {
# extract years of interest from my.df and store in data.frame called year.data
if(y[[4]] != 'last') year.data = my.df[names(my.df) %in% c("county", y[[2]], y[[1]], y[[3]])]
if(y[[4]] == 'last') year.data = my.df[names(my.df) %in% c("county", y[[2]], y[[1]] )]
# subset counties in year.data to only include counties with NA's in current year
if(as.numeric(substr(y[1], 2, 5)) == 1980) year.data = year.data[year.data$county %in% d.counties.80,]
if(as.numeric(substr(y[1], 2, 5)) == 1990) year.data = year.data[year.data$county %in% d.counties.90,]
if(as.numeric(substr(y[1], 2, 5)) == 2000) year.data = year.data[year.data$county %in% d.counties.00,]
# reorder columns in year.data
if(y[[4]] != 'last') year.data = year.data[, c('county', y[[2]], y[[1]], y[[3]])]
if(y[[4]] == 'last') year.data = year.data[, c('county', y[[2]], y[[1]] )]
# values to be divided, or distributed, among counties with NA's in the current year
divide.up <- pre.divide.up[, y[[1]]]
# sum values from designated pre and/or post years and bind those totals to bottom of year.data
if(y[[4]] != 'last') colsums.year = data.frame('total', as.data.frame(t(as.numeric(colSums(year.data[,c(2:4)], na.rm=TRUE)))))
if(y[[4]] == 'last') colsums.year = data.frame('total', as.data.frame(t(as.numeric(colSums(year.data[,c(2:3)], na.rm=TRUE)))))
names(colsums.year) <- names(year.data)
year.data.b <- rbind(year.data, colsums.year)
# obtain percentages in designated pre and/or post years for counties with NA's in current year
year.data.c <- year.data.b
year.data.c[, -1] <- lapply( year.data.c[ , -1], function(x){ x/x[nrow(year.data.b)] } )
# estimate county values for current year by distributing total missing values in current year
# according to how values were distributed in those same counties in other years
if(y[[4]] == 'both') year.data.b[, y[[1]]] = rowMeans(data.frame(year.data.c[, y[[2]]], year.data.c[, y[[3]]])) * as.numeric(divide.up)
if(y[[4]] == 'pre') year.data.b[, y[[1]]] = year.data.c[, y[[2]]] * as.numeric(divide.up)
if(y[[4]] == 'post') year.data.b[, y[[1]]] = year.data.c[, y[[3]]] * as.numeric(divide.up)
if(y[[4]] == 'last') year.data.b[, y[[1]]] = year.data.c[, y[[2]]] * as.numeric(divide.up)
# extract estimates for current year along with the county column, then remove the last row
year.data.last <- year.data.b[names(year.data.b) %in% c("county", y[[1]])]
year.data.last <- year.data.last[-nrow(year.data.last),]
colnames(year.data.last) <- c('county', 'acreage')
# create a data set for export
this.year <- rep(as.numeric(substr(y[[1]], 2, 5)), nrow(year.data.last))
revised.data <- data.frame(state, this.year, year.data.last)
return(revised.data)
}
my.list <- apply(shares, 1, function(y) my.function(y))
my.list2 <- do.call("rbind", my.list)
my.list2
state this.year county acreage
1 my.state 1980 aa 96.470588
3 my.state 1980 ee 1543.529412
2 my.state 1990 cc 14.364641
31 my.state 1990 ee 1149.171271
4 my.state 1990 gg 1436.464088
11 my.state 2000 aa 23.148148
21 my.state 2000 cc 4.629630
41 my.state 2000 gg 462.962963
6 my.state 2000 kk 9.259259
Although this function is not as general as the one in my answer below, the function above does allow explicit designation of which counties have relevant missing values. In the actual data there are two types of missing values and the function in my answer below cannot tell the two types apart. The function above can tell them apart because I tell it exactly which counties to consider each year.
Thank you again for any advice and for advice already offered.