operating on pairs of elements in a data frame

Question

I have two data frames, x and weights, in which columns are paired. Here are example data frames:

x = read.table(text = "
  yr1  yr2  yr3  yr4
   10   15    6    8
   10   20   30   NA
   NA    5    2    3
  100  100   NA   NA", 
sep = "", header = TRUE)

weights = read.table(text = "
  yr1  yr2  yr3  yr4
    2    4    1    3
    2    2    4    2
    3    2    2    3
    4    2    2    4", 
sep = "", header = TRUE)

The columns yr1 and yr2 are one pair and the columns yr3 and yr4 are another pair. With my actual data the columns go up to yr100 and there are 50 pairs of columns.

If yr1 or yr2 is missing in x I want to fill the missing observation with, for example:

(5 / 2) * 3

Likewise for yr3 or yr4:

(30 / 4) * 2

where 5 (or 30) is the element in the column in x that is not missing for a given pair of elements. The values 2 and 3 for the first example (and the values 4 and 2 in the second example) are the corresponding elements in the weights data frame for a given pair of elements in the x data frame. If both elements in a pair are missing in x I want to leave them as missing.

Here is R code that does the above operations using nested for loops. However, there are 2000 or 3000 rows in my actual data set and the nested for loops have been running now for >10 hours.

for(i in 1: (ncol(x)/2)) {
  for(j in 1: nrow(x)) {

    if( is.na(x[j,(1 + (i-1)*2)]) & !is.na(x[j,(1 + (i-1)*2 + 1)])) x[j,(1 + (i-1)*2 + 0)] =  (x[j,(1 + ((i-1)*2 + 1))] / weights[j,(1 + ((i-1)*2 + 1))]) * weights[j,(1 + (i-1)*2 + 0)]
    if(!is.na(x[j,(1 + (i-1)*2)]) &  is.na(x[j,(1 + (i-1)*2 + 1)])) x[j,(1 + (i-1)*2 + 1)] =  (x[j,(1 + ((i-1)*2 + 0))] / weights[j,(1 + ((i-1)*2 + 0))]) * weights[j,(1 + (i-1)*2 + 1)] 
    if( is.na(x[j,(1 + (i-1)*2)]) &  is.na(x[j,(1 + (i-1)*2 + 1)])) x[j,(1 + (i-1)*2 + 0)] =  NA 
    if( is.na(x[j,(1 + (i-1)*2)]) &  is.na(x[j,(1 + (i-1)*2 + 1)])) x[j,(1 + (i-1)*2 + 1)] =  NA

 }
}

I have realized that the third and fourth if statements probably are not necessary. Perhaps the time to run this code will be reduced substantially if I simply remove those two if statements.

However, I also came up with the following alternative solution that uses reshape instead of nested for loops:

n.years <- 4

x2  <- reshape(x      , direction="long", varying = list(seq(1,(n.years-1),2), seq(2,n.years,2)), v.names = c("yr1", "yr2"), times = c("t1", "t2"))
wt2 <- reshape(weights, direction="long", varying = list(seq(1,(n.years-1),2), seq(2,n.years,2)), v.names = c("yr1", "yr2"), times = c("t1", "t2"))

x2$yr1  <- ifelse(is.na(x2$yr1), (x2$yr2 / wt2$yr2) * wt2$yr1, x2$yr1)
x2$yr2  <- ifelse(is.na(x2$yr2), (x2$yr1 / wt2$yr1) * wt2$yr2, x2$yr2)

x3  <- reshape(x2, direction="wide", varying = list(seq(1,3,2), seq(2,4,2)), v.names = c("yr1", "yr2"), times = c("t1", "t2"))
x3

Before I shut the current R session down and try one of the above approaches please suggest possible alternatives that might be more efficient. I have used microbenchmark a little bit, but have not yet attempted to do so here, partially because writing a function for each possible solution is a little intimidating to me. I also tried coming up with a solution using the apply family of functions, but could not come up with one.

My reshape solution was derived from this question:

Reshaping a data frame with more than one measure variable

In addition to computation time I am also concerned about possible memory exhaustion.

I try hard to stick with base R, but will consider using other options to obtain desired output. Thank you for any suggestions.

What are you ultimately trying to do with the data? I suspect in this case that it might be easier to directly obtain a desired output rather than modify the original dataframe. — Thomas, May 24 '13 at 09:38
Once I fill in missing observations I want to perform a weighted average on each pair of elements using the data in the weights data frame. — Mark Miller, May 24 '13 at 09:42

score 1 · Accepted Answer · answered May 24 '13 at 10:12

Does this work for you?

Note that I didn't use your replacement function because I found it a bit confusing, so you will have to fix how you replace the yr1 and yr2 variables with your formula. Also, you'll probably want to reshape the result if you need to be able to attach it to your original dataframe.

newx <- 
reshape(x, direction="long",varying=list(1:50*2-1,1:50*2), v.names=c("v1","v2"))

newwt <- 
reshape(weights, direction="long",varying=list(1:50*2-1,1:50*2), v.names=c("w1","w2"))

condwtmean <- function(x,y,wtx,wty){
    if(xor(is.na(x),is.na(y))){
        if(is.na(x))
            x <- y # replacement function
        if(is.na(y))
            y <- x # replacement function
        return(weighted.mean(c(x,y),c(wtx,wty)))
    }
    else if(!is.na(x) & !is.na(y))
        return(weighted.mean(c(x,y),c(wtx,wty)))
    else
        return(NA)  
}
newx$wtmean <- mapply(condwtmean, newx$v1, newx$v2, newwt$w1, newwt$w2)

Mark Miller · Answer 2 · 2013-05-24T21:07:33.693

Thomas's answer is much better than any of the three approaches I tried. Here I compare the four approaches with microbenchmark. I have not yet tried Thomas's answer with the actual data. My original nested for-loops approach is still running after 22 hours.

Unit: milliseconds
             expr       min        lq   median       uq      max neval
 fn.1(x, weights)  98.69133  99.47574 100.5313 101.7315 108.8757    20
 fn.2(x, weights) 755.51583 758.12175 762.3775 776.0558 801.9615    20
 fn.3(x, weights) 564.21423 567.98822 568.5322 571.0975 575.1809    20
 fn.4(x, weights) 367.05862 370.52657 371.7439 373.7367 395.0423    20

#########################################################################################

# create data

set.seed(1234)

n.rows <- 40
n.cols <- 40
n.sample <- n.rows * n.cols

x <- sample(20, n.sample, replace=TRUE)
x.NA <- sample(n.rows*n.cols, 10*(n.sample / n.rows), replace=FALSE)
x[x.NA] <- NA
x <- as.data.frame(matrix(x, nrow = n.rows))

weights <- sample(4, n.sample, replace=TRUE)
weights <- as.data.frame(matrix(weights, nrow = n.rows))
weights

#########################################################################################

# Thomas's function

fn.1 <- function(x, weights){

newx <- reshape(x, direction="long", varying = list(seq(1,(n.cols-1),2), seq(2,n.cols,2)), v.names=c("v1", "v2"))

newwt <- reshape(weights, direction="long", varying = list(seq(1,(n.cols-1),2), seq(2,n.cols,2)), v.names=c("w1", "w2"))

condwtmean <- function(x,y,wtx,wty){
    if(xor(is.na(x),is.na(y))){
        if(is.na(x))
            x <- (y / wty) * wtx # replacement function
        if(is.na(y))
            y <- (x / wtx) * wty # replacement function
        return(weighted.mean(c(x,y),c(wtx,wty)))
    }
    else if(!is.na(x) & !is.na(y))
        return(weighted.mean(c(x,y),c(wtx,wty)))
    else
        return(NA)  
}

newx$wtmean <- mapply(condwtmean, newx$v1, newx$v2, newwt$w1, newwt$w2)

newx2 <- reshape(newx[,c(1,4:5)], v.names = "wtmean", timevar = "time", direction = "wide")

newx2 <- newx2[,2:(n.cols/2+1)]
names(newx2) <- paste('X', 1:(n.cols/2), sep = "")

return(newx2)

}

fn.1.output <- fn.1(x, weights)

#########################################################################################

# nested for-loops with 4 if statements

fn.2 <- function(x, weights){

for(i in 1: (ncol(x)/2)) {
  for(j in 1: nrow(x)) {

    if( is.na(x[j,(1 + (i-1)*2)]) & !is.na(x[j,(1 + (i-1)*2 + 1)])) x[j,(1 + (i-1)*2 + 0)] =  (x[j,(1 + ((i-1)*2 + 1))] / weights[j,(1 + ((i-1)*2 + 1))]) * weights[j,(1 + (i-1)*2 + 0)]
    if(!is.na(x[j,(1 + (i-1)*2)]) &  is.na(x[j,(1 + (i-1)*2 + 1)])) x[j,(1 + (i-1)*2 + 1)] =  (x[j,(1 + ((i-1)*2 + 0))] / weights[j,(1 + ((i-1)*2 + 0))]) * weights[j,(1 + (i-1)*2 + 1)] 
    if( is.na(x[j,(1 + (i-1)*2)]) &  is.na(x[j,(1 + (i-1)*2 + 1)])) x[j,(1 + (i-1)*2 + 0)] =  NA 
    if( is.na(x[j,(1 + (i-1)*2)]) &  is.na(x[j,(1 + (i-1)*2 + 1)])) x[j,(1 + (i-1)*2 + 1)] =  NA

 }
}

x.weights = x * weights

numerator <- sapply(seq(1,ncol(x.weights),2), function(i) {
  apply(x.weights[,c(i, i+1)], 1, sum, na.rm=T)
})

denominator <- sapply(seq(1,ncol(weights),2), function(i) {
  apply(weights[,c(i, i+1)], 1, sum, na.rm=T)
})

weighted.x <- numerator/denominator

for(i in 1: (ncol(x)/2)) {
  for(j in 1:   nrow(x)      ) {

    if( is.na(x[j,(1 + (i-1)*2)]) & !is.na(x[j,(1 + (i-1)*2 + 1)])) weighted.x[j,i] =  sum(c(x[j,(1 + ((i-1)*2))], x[j,(1 + ((i-1)*2 + 1))]), na.rm = TRUE) 
    if(!is.na(x[j,(1 + (i-1)*2)]) &  is.na(x[j,(1 + (i-1)*2 + 1)])) weighted.x[j,i] =  sum(c(x[j,(1 + ((i-1)*2))], x[j,(1 + ((i-1)*2 + 1))]), na.rm = TRUE) 
    if( is.na(x[j,(1 + (i-1)*2)]) &  is.na(x[j,(1 + (i-1)*2 + 1)])) weighted.x[j,i] =  NA 

 }
}

return(weighted.x)

}

fn.2.output <- fn.2(x, weights)

fn.2.output <- as.data.frame(fn.2.output)
names(fn.2.output) <- paste('X', 1:(n.cols/2), sep = "")

#########################################################################################

# nested for-loops with 2 if statements

fn.3 <- function(x, weights){

for(i in 1: (ncol(x)/2)) {
  for(j in 1: nrow(x)) {

    if( is.na(x[j,(1 + (i-1)*2)]) & !is.na(x[j,(1 + (i-1)*2 + 1)])) x[j,(1 + (i-1)*2 + 0)] =  (x[j,(1 + ((i-1)*2 + 1))] / weights[j,(1 + ((i-1)*2 + 1))]) * weights[j,(1 + (i-1)*2 + 0)]
    if(!is.na(x[j,(1 + (i-1)*2)]) &  is.na(x[j,(1 + (i-1)*2 + 1)])) x[j,(1 + (i-1)*2 + 1)] =  (x[j,(1 + ((i-1)*2 + 0))] / weights[j,(1 + ((i-1)*2 + 0))]) * weights[j,(1 + (i-1)*2 + 1)] 

 }
}

x.weights = x * weights

numerator <- sapply(seq(1,ncol(x.weights),2), function(i) {
  apply(x.weights[,c(i, i+1)], 1, sum, na.rm=T)
})

denominator <- sapply(seq(1,ncol(weights),2), function(i) {
  apply(weights[,c(i, i+1)], 1, sum, na.rm=T)
})

weighted.x <- numerator/denominator

for(i in 1: (ncol(x)/2)) {
  for(j in 1:   nrow(x)      ) {

    if( is.na(x[j,(1 + (i-1)*2)]) & !is.na(x[j,(1 + (i-1)*2 + 1)])) weighted.x[j,i] =  sum(c(x[j,(1 + ((i-1)*2))], x[j,(1 + ((i-1)*2 + 1))]), na.rm = TRUE) 
    if(!is.na(x[j,(1 + (i-1)*2)]) &  is.na(x[j,(1 + (i-1)*2 + 1)])) weighted.x[j,i] =  sum(c(x[j,(1 + ((i-1)*2))], x[j,(1 + ((i-1)*2 + 1))]), na.rm = TRUE) 
    if( is.na(x[j,(1 + (i-1)*2)]) &  is.na(x[j,(1 + (i-1)*2 + 1)])) weighted.x[j,i] =  NA 

 }
}

return(weighted.x)

}

fn.3.output <- fn.3(x, weights)

fn.3.output <- as.data.frame(fn.3.output)
names(fn.3.output) <- paste('X', 1:(n.cols/2), sep = "")

#########################################################################################

# my reshape solution

fn.4 <- function(x, weights){

new.x    <- reshape(x      , direction="long", varying = list(seq(1,(n.cols-1),2), seq(2,n.cols,2)), v.names = c("v1", "v2"))
wt       <- reshape(weights, direction="long", varying = list(seq(1,(n.cols-1),2), seq(2,n.cols,2)), v.names = c("w1", "w2"))

new.x$v1 <- ifelse(is.na(new.x$v1), (new.x$v2 / wt$w2) * wt$w1, new.x$v1)
new.x$v2 <- ifelse(is.na(new.x$v2), (new.x$v1 / wt$w1) * wt$w2, new.x$v2)

x2  <- reshape(new.x, direction="wide", varying = list(seq(1,3,2), seq(2,4,2)), v.names = c("v1", "v2")) 

x <- x2[,2:(n.cols+1)]

x.weights = x * weights

numerator <- sapply(seq(1,ncol(x.weights),2), function(i) {
  apply(x.weights[,c(i, i+1)], 1, sum, na.rm=T)
})

denominator <- sapply(seq(1,ncol(weights),2), function(i) {
  apply(weights[,c(i, i+1)], 1, sum, na.rm=T)
})

weighted.x <- numerator/denominator

for(i in 1: (ncol(x)/2)) {
  for(j in 1:   nrow(x)      ) {

    if( is.na(x[j,(1 + (i-1)*2)]) & !is.na(x[j,(1 + (i-1)*2 + 1)])) weighted.x[j,i] =  sum(c(x[j,(1 + ((i-1)*2))], x[j,(1 + ((i-1)*2 + 1))]), na.rm = TRUE) 
    if(!is.na(x[j,(1 + (i-1)*2)]) &  is.na(x[j,(1 + (i-1)*2 + 1)])) weighted.x[j,i] =  sum(c(x[j,(1 + ((i-1)*2))], x[j,(1 + ((i-1)*2 + 1))]), na.rm = TRUE) 
    if( is.na(x[j,(1 + (i-1)*2)]) &  is.na(x[j,(1 + (i-1)*2 + 1)])) weighted.x[j,i] =  NA 

 }
}

return(weighted.x)

}

fn.4.output <- fn.4(x, weights)

fn.4.output <- as.data.frame(fn.4.output)
names(fn.4.output) <- paste('X', 1:(n.cols/2), sep = "")

#########################################################################################

rownames(fn.1.output) <- NULL
rownames(fn.2.output) <- NULL
rownames(fn.3.output) <- NULL
rownames(fn.4.output) <- NULL

all.equal(fn.1.output, fn.2.output)
all.equal(fn.1.output, fn.3.output)
all.equal(fn.1.output, fn.4.output)
all.equal(fn.2.output, fn.3.output)
all.equal(fn.2.output, fn.4.output)
all.equal(fn.3.output, fn.4.output)

library(microbenchmark)

microbenchmark(fn.1(x, weights), fn.2(x, weights), fn.3(x, weights), fn.4(x, weights), times=20)

#########################################################################################

operating on pairs of elements in a data frame

2 Answers2