2

I am trying to apply a function to a very large matrix I want to eventually create a (40,000 by 40,000) matrix (where only one side of the diagonal is completed) or create a list of the results.

The matrix looks like:

            obs 1     obs 2     obs 3     obs 4     obs 5     obs 6     obs 7     obs 8     obs 9
words 1 0.2875775 0.5999890 0.2875775 0.5999890 0.2875775 0.5999890 0.2875775 0.5999890 0.2875775
words 2 0.7883051 0.3328235 0.7883051 0.3328235 0.7883051 0.3328235 0.7883051 0.3328235 0.7883051
words 3 0.4089769 0.4886130 0.4089769 0.4886130 0.4089769 0.4886130 0.4089769 0.4886130 0.4089769
words 4 0.8830174 0.9544738 0.8830174 0.9544738 0.8830174 0.9544738 0.8830174 0.9544738 0.8830174
words 5 0.9404673 0.4829024 0.9404673 0.4829024 0.9404673 0.4829024 0.9404673 0.4829024 0.9404673
words 6 0.0455565 0.8903502 0.0455565 0.8903502 0.0455565 0.8903502 0.0455565 0.8903502 0.0455565

I use the function using cosine(mat[, 3], mat[, 4]) which gives me a single number.

          [,1]
[1,] 0.7546113

I can do this for all of the columns but I want to be able to know which columns they came from, i.e. the calculation above came from columns 3 and 4 which is "obs 3" and "obs 4".

Expected output might be the results in a list or a matrix like:

          [,1]   [,1]   [,1]
[1,]        1      .      .
[1,]      0.75     1      .
[1,]      0.23    0.87    1

(Where the numbers here are made up)

So the dimensions will be the size of the ncol(mat) by ncol(mat) (if I go the matrix method).

Data/Code:

#generate some data

mat <- matrix(data = runif(200), nrow = 100, ncol = 20, dimnames = list(paste("words", 1:100),
                                                                        paste("obs", 1:20)))


mat


#calculate the following function
library(lsa)
cosine(mat[, 3], mat[, 4])
cosine(mat[, 4], mat[, 5])
cosine(mat[, 5], mat[, 6])

Additional

I thought about doing the following: - Creating an empty matrix and calculating the function in a forloop but its not working as expected and creating a 40,000 by 40,000 matrix of 0's brings up memory issues.

co <- matrix(0L, nrow = ncol(mat), ncol = ncol(mat), dimnames = list(colnames(mat), colnames(mat)))
co

for (i in 2:ncol(mat)) {
  for (j in 1:(i - 1)) {
    co[i, j] = cosine(mat[, i], mat[, j])
  }
}

co

I also tried putting the results into a list:

List <- list()
for(i in 1:ncol(mat))
{
  temp <- List[[i]] <- mat
}

res <- List[1][[1]]
res

Which is also wrong.

So I am trying to create a function which will column by column calculate the function and store the results.

user113156
  • 6,761
  • 5
  • 35
  • 81

4 Answers4

2

One option is to define a function to apply for two columns and then use outer to apply to all combination of columns.

fun <- function(x, y) {
   cosine(mat[, x], mat[, y])
}

outer(seq_len(ncol(mat)), seq_len(ncol(mat)), Vectorize(fun))

#       [,1]   [,2]   [,3]   [,4]   [,5]  ..... 
#[1,] 1.0000 0.7824 1.0000 0.7824 1.0000 .....
#[2,] 0.7824 1.0000 0.7824 1.0000 0.7824 .....
#[3,] 1.0000 0.7824 1.0000 0.7824 1.0000 .....
#[4,] 0.7824 1.0000 0.7824 1.0000 0.7824 .....
#[5,] 1.0000 0.7824 1.0000 0.7824 1.0000 .....
#....
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thanks! This works, and it works well on some other data and gives the correct results, but on the full data I get the same memory issues `Error: cannot allocate vector of size 7.1 Gb`. I might have to split the data up into chunks and calculate these... – user113156 May 25 '19 at 15:46
  • @user113156 yes, it gives you that error if the data is too big to fit into the memory available. There are many posts available if you google that error message explaining how to efficiently manage such huge data. One of them is here https://stackoverflow.com/questions/5171593/r-memory-management-cannot-allocate-vector-of-size-n-mb – Ronak Shah May 25 '19 at 15:49
2

1) Using mat shown in the question, the first line creates a 20x20 matrix with all 20*20 cosines filled in. The second line zeros out the values on and above the diagonal. Use lower.tri instead if you prefer that the values on and below the diagonal be zero'd out.

comat <- cosine(mat)
comat[upper.tri(comat, diag = TRUE)] <- 0

2) Alternately to create a named numeric vector of the results:

covec <- c(combn(as.data.frame(mat), 2, function(x) c(cosine(x[, 1], x[, 2]))))
names(covec) <- combn(colnames(mat), 2, paste, collapse = "-")

3) We can use the fact that the off-diagonal cosines are the same as correlations up to a factor, mult.

mult <- c(cosine(mat[, 1], mat[, 2]) / cor(mat[, 1], mat[, 2]))
co3 <- mult * cor(mat)
co3[upper.tri(co3, diag = TRUE)] <- 0

3a) This opens up using any of several correlation functions available in R. For example, using mult just calculated:

library(HiClimR)
co4 <- mult * fastCor(mat)
co4[upper.tri(co4, diag = TRUE)] <- 0

3b)

library(propagate)
co5 <- mult * bigcor(mat)
co5[upper.tri(co5, diag = TRUE)] <- 0

3c)

co6 <- crossprod(scale(mat)) / (nrow(mat) - 1)
co6[upper.tri(co6, diag = TRUE)] <- 0
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Yay for a vectorized language! – Aaron left Stack Overflow May 25 '19 at 16:28
  • Thanks! your first method `cosine(mat` gave me the `cannot allocate vector... memory` error. Your second method sees to work, I am currently running @akrun method at the moment, I will run your second method as soon as it finishes and let you know. – user113156 May 25 '19 at 16:41
1

We can do this with a nested sapply

i1 <- seq_len(ncol(mat))
sapply(i1, function(i) sapply(i1, function(j) cosine(mat[, i], mat[, j])))    #         [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]      #[,8]      [,9]     [,10]     [,11]     [,12]
# [1,] 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016
# [2,] 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000
# [3,] 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016
# [4,] 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000
# [5,] 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016
# [6,] 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000
# [7,] 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016
# ....
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks! I have set `mat <- Matrix(mat, sparse = TRUE)` to make things faster and it is running. I have no idea if it will cause memory issues as previous attempts but its processing and I will let you know the results. Why would this method "seem" to work whereas my other attemps failed? Whats going on inside R? – user113156 May 25 '19 at 16:29
0

We can use iteration over indexes using purrr (as a better(?) alternative to for loops). I think the toy dataset was supposed to have 2000, not 200 datapoints?

library(tidyverse)

mat <-
  matrix(
    data = runif(2000),
    nrow = 100,
    ncol = 20,
    dimnames = list(paste("words", 1:100),
                    paste("obs", 1:20))
  )

cos_summary <- tibble(Row1 = 3:5, Row2 = 4:6)

cos_summary <- cos_summary %>%
  mutate(cos_1_2 = map2_dbl(Row1, Row2, ~lsa::cosine(mat[,.x], mat[,.y])))

cos_summary

# A tibble: 3 x 3
   Row1  Row2 cos_1_2
  <int> <int>   <dbl>
1     3     4   0.710
2     4     5   0.734
3     5     6   0.751
Marian Minar
  • 1,344
  • 10
  • 25