I have two data frames, x
and weights
, in which columns are paired. Here are example data frames:
x = read.table(text = "
yr1 yr2 yr3 yr4
10 15 6 8
10 20 30 NA
NA 5 2 3
100 100 NA NA",
sep = "", header = TRUE)
weights = read.table(text = "
yr1 yr2 yr3 yr4
2 4 1 3
2 2 4 2
3 2 2 3
4 2 2 4",
sep = "", header = TRUE)
The columns yr1
and yr2
are one pair and the columns yr3
and yr4
are another pair. With my actual data the columns go up to yr100
and there are 50 pairs of columns.
If yr1
or yr2
is missing in x
I want to fill the missing observation with, for example:
(5 / 2) * 3
Likewise for yr3
or yr4
:
(30 / 4) * 2
where 5 (or 30) is the element in the column in x
that is not missing for a given pair of elements. The values 2 and 3 for the first example (and the values 4 and 2 in the second example) are the corresponding elements in the weights
data frame for a given pair of elements in the x
data frame. If both elements in a pair are missing in x
I want to leave them as missing.
Here is R
code that does the above operations using nested for loops
. However, there are 2000 or 3000 rows in my actual data set and the nested for loops
have been running now for >10 hours.
for(i in 1: (ncol(x)/2)) {
for(j in 1: nrow(x)) {
if( is.na(x[j,(1 + (i-1)*2)]) & !is.na(x[j,(1 + (i-1)*2 + 1)])) x[j,(1 + (i-1)*2 + 0)] = (x[j,(1 + ((i-1)*2 + 1))] / weights[j,(1 + ((i-1)*2 + 1))]) * weights[j,(1 + (i-1)*2 + 0)]
if(!is.na(x[j,(1 + (i-1)*2)]) & is.na(x[j,(1 + (i-1)*2 + 1)])) x[j,(1 + (i-1)*2 + 1)] = (x[j,(1 + ((i-1)*2 + 0))] / weights[j,(1 + ((i-1)*2 + 0))]) * weights[j,(1 + (i-1)*2 + 1)]
if( is.na(x[j,(1 + (i-1)*2)]) & is.na(x[j,(1 + (i-1)*2 + 1)])) x[j,(1 + (i-1)*2 + 0)] = NA
if( is.na(x[j,(1 + (i-1)*2)]) & is.na(x[j,(1 + (i-1)*2 + 1)])) x[j,(1 + (i-1)*2 + 1)] = NA
}
}
I have realized that the third and fourth if
statements probably are not necessary. Perhaps the time to run this code will be reduced substantially if I simply remove those two if
statements.
However, I also came up with the following alternative solution that uses reshape
instead of nested for loops
:
n.years <- 4
x2 <- reshape(x , direction="long", varying = list(seq(1,(n.years-1),2), seq(2,n.years,2)), v.names = c("yr1", "yr2"), times = c("t1", "t2"))
wt2 <- reshape(weights, direction="long", varying = list(seq(1,(n.years-1),2), seq(2,n.years,2)), v.names = c("yr1", "yr2"), times = c("t1", "t2"))
x2$yr1 <- ifelse(is.na(x2$yr1), (x2$yr2 / wt2$yr2) * wt2$yr1, x2$yr1)
x2$yr2 <- ifelse(is.na(x2$yr2), (x2$yr1 / wt2$yr1) * wt2$yr2, x2$yr2)
x3 <- reshape(x2, direction="wide", varying = list(seq(1,3,2), seq(2,4,2)), v.names = c("yr1", "yr2"), times = c("t1", "t2"))
x3
Before I shut the current R session down and try one of the above approaches please suggest possible alternatives that might be more efficient. I have used microbenchmark
a little bit, but have not yet attempted to do so here, partially because writing a function for each possible solution is a little intimidating to me. I also tried coming up with a solution using the apply
family of functions, but could not come up with one.
My reshape
solution was derived from this question:
Reshaping a data frame with more than one measure variable
In addition to computation time I am also concerned about possible memory exhaustion.
I try hard to stick with base R, but will consider using other options to obtain desired output. Thank you for any suggestions.