16

I have a ~20,000x20,000 data, how do i convert the from data.table() to a matrix efficiently in terms of speed and memory?

I tried m = as.matrix(dt) but it takes very long with many warnings. df = data.frame(dt) takes very long and result in reaching memory limits as well.

Is there any efficient way to do this? Or, simply a function in data.table which returns dt as as matrix form(as required to feed into a statistical model using the glmnet package)?

Simply wrapping into as.matrix gives me below error:

x = as.matrix(dt)

Error: cannot allocate vector of size 2.9 Gb
In addition: Warning messages:
  1: In unlist(X, recursive = FALSE, use.names = FALSE) : Reached total allocation of 8131Mb: see help(memory.size)
  2: In unlist(X, recursive = FALSE, use.names = FALSE) : Reached total allocation of 8131Mb: see help(memory.size)
  3: In unlist(X, recursive = FALSE, use.names = FALSE) : Reached total allocation of 8131Mb: see help(memory.size)
  4: In unlist(X, recursive = FALSE, use.names = FALSE) : Reached total allocation of 8131Mb: see help(memory.size)

My OS: I have 64 bit Windows7 and 8gb ram, my Windows task manager shows Rgui.exe taking up spaces more than 4gb before and were still fine though.

zx8754
  • 52,746
  • 12
  • 114
  • 209
Gibson Gay
  • 163
  • 1
  • 1
  • 4
  • 2
    Please provide a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – Joshua Ulrich Oct 02 '12 at 14:11
  • can you give a taste of what your data looks like (use `dput(subsetofyourdata)`)? What were the warnings you saw when you tried `as.matrix`? – Justin Oct 02 '12 at 14:11
  • can you put the structure of your table in the question? – add-semi-colons Oct 02 '12 at 14:12
  • i'd double check that your computer isn't going to explode by feeding it a 20k by 20k matrix first...me thinks that will likely be the case if you don't have the memory hanging around to convert to `as.matrix()`. You can give this a whirl some random data like so `matrix(runif(20000*20000),ncol = 20000)`. On my machine, this takes up about 3GB worth of space...so is not a svelte chunk of data by any means. – Chase Oct 02 '12 at 14:47
  • @Justin > x=as.matrix(dt) Error: cannot allocate vector of size 2.9 Gb In addition: Warning messages: 1: In unlist(X, recursive = FALSE, use.names = FALSE) : Reached total allocation of 8131Mb: see help(memory.size) 2: In unlist(X, recursive = FALSE, use.names = FALSE) : Reached total allocation of 8131Mb: see help(memory.size) 3: In unlist(X, recursive = FALSE, use.names = FALSE) : Reached total allocation of 8131Mb: see help(memory.size) 4: In unlist(X, recursive = FALSE, use.names = FALSE) : Reached total allocation of 8131Mb: see help(memory.size) – Gibson Gay Oct 02 '12 at 15:17
  • 2
    @Null-Hypothesis dt contains 1 character column (key) and integers for the rest. – Gibson Gay Oct 02 '12 at 15:20
  • @Chase I have 64bit windows 7 and 8gb ram, my windows task manager shows Rgui.exe taking up spaces more than 4gb before and were still fine though. – Gibson Gay Oct 02 '12 at 15:21
  • @GibsonGay - the error message above indicates you're running out of memory. General rule of thumb re: memory management is that you need 3x available memory for any given object you're trying to operate on. `data.table()` relaxes some of those criteria due to it's awesomeness - but you're trying to go away from `data.table()`. In my experience, the modeling functions require more than 3x memory on occasion but i have no experience with `glmnet()`. Unless you have another stick of memory hanging around, I think you're better off figure out Amazon EC2 and launching this in the cloud. – Chase Oct 02 '12 at 15:28
  • See [here](http://www.bioconductor.org/help/bioconductor-cloud-ami/) for making Amazon EC2 trivially easy to use. – Chase Oct 02 '12 at 15:29
  • 7
    @Chase Thanks alot, I agree data.table is super awesome. I have made an error on my part to include the character column into the matrix, which elevated the matrix's class to character for all columns. removing this column allowed a integer matrix to be made and it converted successfully without errors/warnings and ran the model fine.:) Thank you for all your help though, I will certainly keep Amazon EC2 in mind! – Gibson Gay Oct 02 '12 at 15:55
  • 4
    @GibsonGay Thanks for the update. I was starting to worry there for a second. Could you self answer, and self accept, to wrap up please. – Matt Dowle Oct 02 '12 at 16:18
  • 2
    @GibsonGay Self-answer neatly please... – Erdogan CEVHER Dec 18 '17 at 11:57
  • @MattDowle moved the OP's resolution to wiki answer. OP is offline since 2012. – zx8754 Nov 22 '18 at 12:49

2 Answers2

3

Try:

    result <- as.matrix(tidytext::cast_sparse(dat_table,
    column_name_of_rows,
    column_name_of_columns,
    column_name_of_values))

It should be very efficient and fast.

P. Denelle
  • 790
  • 10
  • 24
2

@GibsonGay:

I have made an error on my part to include the character column into the matrix, which elevated the matrix's class to character for all columns. Removing this column allowed a integer matrix to be made and it converted successfully without errors/warnings and ran the model fine.

zx8754
  • 52,746
  • 12
  • 114
  • 209