extracting unique pairs from large correlation matrix

Question

I have a fairly large dataset (~50K entries) which I use to generate a correlation matrix. This works well, using "only" ~20GB RAM.

Then, I want to extract only the unique pairwise combinations from it and convert it into a data frame. This is where I run into issues. Either too much RAM usage or overflowing the indexing variable(s). I know there are >2B combinations, so I am aware it explodes a bit in size, but still..

I have tried different ways to achieve this, but with no success.

Mock data:

df = matrix(runif(1),nrow=50000, ncol=50000, dimnames=list(seq(1,50000,by=1), seq(1,50000,by=1)))

Trying to extract upper/lower triangle from the correlation matrix and then reshape it:

df[lower.tri(df, diag = T),] = NA
df = reshape2::melt(df, na.rm = T)

crashes with:

Error in df[lower.tri(bla, diag = T), ] = NA : 
  long vectors not supported yet: ../../src/include/Rinlinedfuns.h:522

It crashes with the same error if you do only: df = df[lower.tri(df, diag = T),] (I did read through Large Matrices in R: long vectors not supported yet but I didn't find it helpful for my situation)

I also tried:

df = subset(as.data.frame(as.table(df)),
       match(Var1, names(annotation_table)) > match(Var2, names(annotation_table)))

to use only R-base packages, but it eventually ran out of memory after ~1 day. This is the most RAM intensive part: as.data.frame(as.table(df)) so I tried also replacing it with reshape2::melt(df) but it also ran out of RAM

I am running the code on an Ubuntu machine with 128GB RAM. I do have larger machines, but i would've expected that this amount of RAM should suffice.

Any help would be highly appreciated. Thank you.

You shouldn't have a comma before the closing bracket in `df[lower.tri(df, diag = T),] = NA`. Will using `df[lower.tri(df, diag = T)] = NA` help? — Robert Hacken, Jul 20 '23 at 12:52
hi Robert. thanks for the comment. well.. it was a glorious facepalm moment, followed by damn it moment. this worked: `df[lower.tri(df, diag = T)] = NA` but then `df = reshape2::melt(df, na.rm = T)` failed with `Error in if (n > 0) c(NA_integer_, -n) else integer() : missing value where TRUE/FALSE needed` — Marius, Jul 20 '23 at 13:22
If `x` is the `n`-by-`n` matrix, then you can also try `x[sequence(seq_len(n), seq.int(1L, length(x), n))]`, to get a length-`n*(n+1)/2` vector containing the upper triangular elements in column-major order. — Mikael Jagan, Jul 20 '23 at 13:23
hi Mikael. thank you for your suggestion. I am trying to implement your approach, but `sequence(seq_len(50000), seq.int(1L, length(2500000000), 50000))` fails with `Error in sequence(seq_len(50000), seq.int(1L, length(2.5e+09), 50000)) : unused argument (seq.int(1, length(2.5e+09), 50000))` and honestly I don't really understand why — Marius, Jul 20 '23 at 13:49
*update*: the code in the above comment failed under R 3.6.3, but it works with R > 4.0 as @Mikael pointed out — Marius, Jul 20 '23 at 15:14
Why do you want to store it in a dataframe to start with? Why not store it as a vector? For an object as huge as this, you do not need a dataframe. Some other structures eg `dist` would use less memory. ie dist will take approximately 4 times less memory than the matrix itself. Am quite convinced storing only the vector of correlations, ie lower matrix, and the dimnames is sufficient. You do not need the various dimname combinations. In your case, you should only stor a vector of length `50000 * (50000 - 1)/2` and a dimnames of length 50000 as an attribute. That should is enough. — Onyambu, Jul 20 '23 at 15:26
thanks for your comment. this is just the first step in a more complex analysis where objects are shared. a table or a data frame is the most compatible with the rest of the code — Marius, Jul 20 '23 at 15:29
You could restructure your code. Often try to look for efficient ways (memory and speed) to write your code. Note that with the method I am proposing, You will store a vector of 5GB, then have few functions to extract the element you need given any two combinations of the dimnames. Why am I proposing this? Just because a dataframe can hold the data, it does not necessarily mean it *should*. For example, just because one can write a for loop in R to find the maximum it does not mean it *should* be so. You can easily store the vector and still make the code compatible to the rest of it — Onyambu, Jul 20 '23 at 17:04
if only it were that easy to restructure all the code... i guess you are referring to cases where you are working on a small script all by yourself, and not having to work in large collaborations with legacy code tied up to certifications and other dependent systems. thanks for the theory tho. maybe a concrete example from your side with a bit of benchmarking RAM/CPU/time would be more helpful — Marius, Jul 20 '23 at 18:01

Mikael Jagan · Accepted Answer · 2023-07-21T14:56:31.647

If you have as much RAM as you say, then this really should work without issue for n much greater than 6.

set.seed(0)
n <- 6L
x <- provideDimnames(cor(matrix(rnorm(as.double(n) * n), n, n)))
x

            A           B            C           D           E            F
A  1.00000000  0.42679900  0.113100027 -0.03952030 -0.02406114 -0.693427730
B  0.42679900  1.00000000  0.519377903  0.06136646 -0.51713799 -0.331961466
C  0.11310003  0.51937790  1.000000000 -0.43996491 -0.32225557 -0.006199606
D -0.03952030  0.06136646 -0.439964909  1.00000000 -0.42053358  0.537301520
E -0.02406114 -0.51713799 -0.322255571 -0.42053358  1.00000000 -0.367595700
F -0.69342773 -0.33196147 -0.006199606  0.53730152 -0.36759570  1.000000000

s <- seq_len(n) - 1L
nms <- dimnames(x)
dat <- data.frame(val = x[sequence(s, seq.int(1L, length(x), n))],
                  row = gl(n, 1L, labels = nms[[1L]])[sequence(s, 1L)], 
                  col = rep.int(gl(n, 1L, labels = nms[[2L]]), s))
dat

            val row col
1   0.426798998   A   B
2   0.113100027   A   C
3   0.519377903   B   C
4  -0.039520302   A   D
5   0.061366463   B   D
6  -0.439964909   C   D
7  -0.024061141   A   E
8  -0.517137993   B   E
9  -0.322255571   C   E
10 -0.420533577   D   E
11 -0.693427730   A   F
12 -0.331961466   B   F
13 -0.006199606   C   F
14  0.537301520   D   F
15 -0.367595700   E   F

If you are using a version of R older than 4.0.0, where sequence is defined differently, then you'll want something like:

sequence <- function(nvec, from = 1L, by = 1L)
    unlist(.mapply(seq.int, list(from = from, by = by, length.out = nvec), NULL),
           recursive = FALSE, use.names = FALSE)

Note that seq.int accepts both integer and double arguments and conveniently returns a double rather than an integer if it determines that the result would overflow integer, which happens in this case when n * n > .Machine$integer.max.

M <- .Machine$integer.max
typeof(seq.int(from = 1L, to = 1 + M, by = M))

[1] "double"

@Marius AFAICT the only call where integer overflow is possible is `sequence(s, seq.int(1L, length(x), n))`. As long as `sequence` does not try to coerce its second argument (a `double` if `length(x) > .Machine$integer.max`) to `integer`, then you should be OK even without changing `seq.int->seq`, `1L->1`, etc. — Mikael Jagan, Jul 21 '23 at 15:05
I see. Probably/hopefully you're right. Unfortunately, I don't have time to test it these coming days. If not, there's a working version with `double` everywhere in my answer. Thanks for updating your answer and for all the help @Mikael. I'll select your answer as *the* one — Marius, Jul 24 '23 at 07:00

Marius · Answer 2 · 2023-07-24T07:03:58.740

Okay, after digging a bit more and trying other stuff, I found one solution that eventually worked in a previous post:

upper_ind = which(upper.tri(df, diag=F), arr.ind = T)
gc() # clean up and free some RAM
df = data.frame(first = dimnames(df)[[2]][upper_ind[,2]],
                second = dimnames(df)[[1]][upper_ind[,1]],
                correlation = df[upper_ind])

For reference, testing it out on my real data (49,105 x 49,105 correlation matrix):

only retrieving the index of the elements within the upper triangle of the correlation matrix (first command) peaked at ~90GB RAM
putting together the matrix (second command) peaked at ~100GB of RAM

If you have such large datasets, I really suggest you call the garbage collector between the two commands as it actually helped. Timewise, it took less than 10 minutes. It is not ideal, but given my setup and time constraints, it is a solution.

Thank you @Robert Hacken for spotting that erroneous , in df[lower.tri(df, diag = T),] = NA (i.e., the comma before the closing bracket should be removed.

I think that what @Mikael Jagan has proposed might be more memory efficient.

Update

With the updates from @Mikael on his answer and a bit of tweaking, his approach is also working on very large datasets. The tweaking is mainly to use double instead of integer as the data that I work with passes the range of an integer (e.g., the size of the data is: length(x) = 2,411,301,025 while the integer limit is: `.Machine$integer.max = 2,147,483,647). Of course, his intention was to speed it up as much as possible, hence using integers instead of doubles, but for my use case this is not possible.

So modifying @Mikael's code to:

generate_sequence = function(nvec, from = 1, by = 1)
  unlist(.mapply(seq, list(from = from, by = by, length.out = nvec), NULL),
         recursive = FALSE, use.names = FALSE)

and the rest to:

set.seed(0)
n <- 6
x <- provideDimnames(cor(matrix(rnorm(n * n), n, n)))
s <- seq_len(n) - 1
nms <- dimnames(x)
dat <- data.frame(val = x[sequence(s, seq(1, length(x), n))],
                  row = gl(n, 1, labels = nms[[1]])[sequence(s, 1)], 
                  col = rep(gl(n, 1, labels = nms[[2]]), s))

will get the job done on my data in ~2m and peak at ~70GB RAM. Thanks again for your answer @Mikael.

If you run into trouble with the sequence generation, please read through the comments as well

extracting unique pairs from large correlation matrix

2 Answers2