I have a fairly large dataset (~50K entries) which I use to generate a correlation matrix. This works well, using "only" ~20GB RAM.
Then, I want to extract only the unique pairwise combinations from it and convert it into a data frame. This is where I run into issues. Either too much RAM usage or overflowing the indexing variable(s). I know there are >2B combinations, so I am aware it explodes a bit in size, but still..
I have tried different ways to achieve this, but with no success.
Mock data:
df = matrix(runif(1),nrow=50000, ncol=50000, dimnames=list(seq(1,50000,by=1), seq(1,50000,by=1)))
Trying to extract upper/lower triangle from the correlation matrix and then reshape it:
df[lower.tri(df, diag = T),] = NA
df = reshape2::melt(df, na.rm = T)
crashes with:
Error in df[lower.tri(bla, diag = T), ] = NA :
long vectors not supported yet: ../../src/include/Rinlinedfuns.h:522
It crashes with the same error if you do only: df = df[lower.tri(df, diag = T),]
(I did read through Large Matrices in R: long vectors not supported yet but I didn't find it helpful for my situation)
I also tried:
df = subset(as.data.frame(as.table(df)),
match(Var1, names(annotation_table)) > match(Var2, names(annotation_table)))
to use only R-base packages, but it eventually ran out of memory after ~1 day. This is the most RAM intensive part: as.data.frame(as.table(df))
so I tried also replacing it with reshape2::melt(df)
but it also ran out of RAM
I am running the code on an Ubuntu machine with 128GB RAM. I do have larger machines, but i would've expected that this amount of RAM should suffice.
Any help would be highly appreciated. Thank you.