2

I want to reduce time and memory usage (I previously used outer for this but it consumes more memory than I have) by reducing the iterations to create a symmetric matrix, that is sol[i, j] is the same as sol[j, i].

My code so far:

# Prepare input
subss <- list(a = c(1, 2, 4), b = c(1, 2, 3), c = c(4, 5))
A <- matrix(runif(25), ncol = 5, nrow = 5)
# Pre allocate memory
sol <- matrix(nrow = length(subss), ncol = length(subss), 
          dimnames = list(names(subss), names(subss))) 
x <- 0
for (i in seq_along(subss)) {
    # Omit for the subsets I already calculated ?
    for (j in seq_along(subss)) {
        x <- x + 1
        message(x)

        # The function I use here might result in a NA
        sol[i, j] <- mean(A[subss[[i]], subss[[j]]]) 
        sol[j, i] <- sol[i, j] # Will overwrite when it shouldn't
    }
}

Will use 9 iterations, how can I avoid them and do just 6 iterations?

I need to calculate the symmetric values, so this question doesn't apply. Also this other one doesn't work either because there might be many combinations and at some point it can't allocate the vector in memory.

Community
  • 1
  • 1
llrs
  • 3,308
  • 35
  • 68
  • No, first it won't use my subss, so the dimension of what you propose is 5 x5 not 3x3, second my real function instead of mean is more complex – llrs Apr 06 '17 at 11:02

3 Answers3

1

A for loop will usually be slower than outer. Try byte-compiling the loop or implement it in Rcpp.

subss <- list(a = c(1, 2, 4), b = c(1, 2, 3), c = c(4, 5))
set.seed(42)
A <- matrix(runif(25), ncol = 5, nrow = 5)

#all combinations of indices
ij <- combn(seq_along(subss), 2)

#add all i = j
ij <- matrix(c(ij, rep(seq_along(subss), each = 2)), nrow = 2)

#preallocate
res <- numeric(ncol(ij))

#only one loop
for (k in seq_len(ncol(ij))) {

    message(k)

    res[k] <- mean(A[subss[[ij[1, k]]], subss[[ij[2, k]]]]) 
}
#1
#2
#3
#4
#5
#6

#create symmetric sparse matrix    
library(Matrix)
sol <- sparseMatrix(i = ij[1,], j = ij[2,],
                    x = res, dims = rep(length(subss), 2), 
                    symmetric = TRUE, index1 = TRUE)
#3 x 3 sparse Matrix of class "dsCMatrix"
#                                  
#[1,] 0.7764715 0.6696987 0.7304413
#[2,] 0.6696987 0.6266553 0.6778936
#[3,] 0.7304413 0.6778936 0.5161089
Roland
  • 127,288
  • 10
  • 191
  • 288
  • I will test how it works. But subss is 20000 long so all combinations are quite big. Also I try to avoid dependencies and the matrix is not sparse, what advantage does it bring? That it is stored more efficiently? – llrs Apr 06 '17 at 12:41
  • A symmetric matrix is sparse. After all, you only need to store one triangle and the diagonal. On my system it takes about a minute to calculate the 400 million combinations of `i` and `j`. Your performance problem will most likely rather be the 400 million calls to your function. You should seriously consider if your really need to do this and if you do, you should use Rcpp for the task. – Roland Apr 06 '17 at 13:26
  • Time is not much a constraint (with `outer` it gets things done in ~5 hours depending on the data), but memory it is (with `outer` the process reaches 91Gb as measured with htop), that's why I thought of using a loop to prevent having all the subsets from outer in memory. But I might end up moving the function to Rcpp as you said. – llrs Apr 06 '17 at 13:52
  • Sorry to bother you again, but the for loop takes a while in the computer, just using the mean with the 200 million of combinations of `i` and `j`. So I assume it is not only my function call, (which in `outer` is called just once). Anyway, thanks for your help. – llrs Apr 14 '17 at 09:28
  • Any R function call (and mean isn't very fast because it is an S3 generic) takes time. Multiply that time with a large number of iterations and you are in for a wait. That's why you should vectorize and one reason why Rcpp was created. – Roland Apr 14 '17 at 09:32
  • I posted the function in [Code Review],(https://codereview.stackexchange.com/q/160022/36067), in case you want to have a look at it. Thanks for your help! – llrs Apr 14 '17 at 09:43
0

I found a way with plain for loops:

x <- 0
for (i in seq_along(subss)) {
    for (j in seq_len(i)) { # or for (j in 1:i) as proposed below
        x <- x + 1
        message(x)

        sol[i, j] <- mean(A[subss[[i]], subss[[j]]]) 
        sol[j, i] <- sol[i, j]
    }
}
llrs
  • 3,308
  • 35
  • 68
0
for (i in 1:length(subss)) {
  for (j in 1:i) {
    message(i, ' ', j, ' - ', mean(A[subss[[i]], subss[[j]]]) ) # Check iterations and value
    sol2[i, j] <- sol2[j, i] <- mean(A[subss[[i]], subss[[j]]]) 
  }
}

I checked your script values and aren't symmetric:

1 1 - 0.635455905252861
1 2 - 0.638608284398086
1 3 - 0.488700995299344
2 1 - 0.568414432255344
2 2 - 0.602851431118324
2 3 - 0.516099992596234
3 1 - 0.595461705311512
3 2 - 0.656920690399905
3 3 - 0.460815121419728

Mine values (same as @Llopis):

1 2 - 0.638608284398086
1 3 - 0.488700995299344
2 2 - 0.602851431118324
2 3 - 0.516099992596234
3 2 - 0.656920690399905
3 3 - 0.460815121419728
gonzalez.ivan90
  • 1,322
  • 1
  • 12
  • 22