1

I'm at a total loss on this one. I have a large, though not unreasonable, matrix for my data frame in R (48000 * 19). I'm trying to use sm.ancova() to investigate the differential effect slopes, but got

error: cannot allocate vector of size 13.1GB

13GB overtaxed the memory allocated to R, I get that. But... what?! The entire CSV file I read in was only 24,000kb. Why are these single vectors so huge in R?

The ancova code I'm using is:

data1<-read.csv("data.csv")
attach(data1)
sm.ancova(s,dt,dip,model="none") 

Looking in to it a bit, I used:

diag(s)
length(s)
diag(dt)
length(dt)
diag(dip)    
length(dip)

Which all gave the same error. Their lengths are all 48000.

Any explanation would help. A fix would be better :)

Thanks in advance!

A dummy data link that reproduces this problem can be found at: https://www.dropbox.com/s/dxxofb3o620yaw3/stackexample.csv?dl=0

Jesse001
  • 924
  • 1
  • 13
  • 37
  • 1
    Possible dupe: [R memory management / cannot allocate vector of size n](http://stackoverflow.com/q/5171593/903061) – Gregor Thomas Oct 12 '16 at 18:19
  • 2
    `sm.ancova` is trying to allocate an object of great size. The code written by the author of the package is likely not as memory efficient as it could. – Vlo Oct 12 '16 at 18:21
  • Gregor: not quite, that one is trying to find a work around for legitimately oversized data. I'm trying to figure out why my vectors are getting so large (orders of magnitude larger than the original file), and how to prevent it. Similar, but a bit different – Jesse001 Oct 12 '16 at 18:22
  • Vlo: do you have a recommendation for another non-parametric ancova wrapper? – Jesse001 Oct 12 '16 at 18:23
  • 3
    If you want a specific answer, you should provide a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) that we could troubleshoot to see what's going on. If you want another analysis recommendation, you should ask over at [stats.se] instead. – MrFlick Oct 12 '16 at 18:29
  • That's the extent of the problem. It's not a coding issue, the codes all run fine. The problem is that the data are getting too large. Short of posting the data (which I cannot legally do), I don't see a way to do provide a more reproducible example. I assume any coding issue that exists is part of the wrapper function I'm using, sm.ancova, but I don't know for sure. I read in the data, attach it, and run the above script. The stuff in the code are column titles. – Jesse001 Oct 12 '16 at 18:31
  • @Jesse001 You could just share your whole data or make a dummy dataset that reproduces your trouble. – s_baldur Oct 12 '16 at 18:34
  • @Snoram I made a dummy set, but how do I load something that large on this website? – Jesse001 Oct 12 '16 at 18:41
  • I would use a link via Dropbox (Google drive is one alternative amongst many) to the csv. – s_baldur Oct 12 '16 at 18:51
  • 2
    I edited the question with a dropbox link to the data, thanks for the suggestion – Jesse001 Oct 12 '16 at 18:58

1 Answers1

4

Get data:

## CSV file is 10M on disk, so it's worth using a faster method
##   than read.csv() to import ...
data1 <- data.table::fread("stackexample.csv",data.table=FALSE)
dd <- data1[,c("s","dt","dip")]

If you give diag() a vector, it's going to try to make a diagonal matrix with that vector on the diagonal. The example data set you gave us is 96,000 rows long, so diag() applied to any element will try to construct a 96,000 x 96,000 matrix. A 1000x1000 matrix is

format(object.size(diag(1000)),"Mb")  ## 7.6 Mb

so the matrix you're trying to construct here will be 96^2*7.6/1024 = 68 Gb.

A 24Kx24K matrix would be 16 times smaller but still about 4 Gb ...

It is possible to use sparse matrices to construct big diagonal matrices:

library(Matrix)
object.size(Diagonal(x=1:96000))
## 769168 bytes

More generally, not all analysis programs are written with computational efficiency (either speed or memory) in mind. The papers on which this method is based (?sm.ancova) were written in the late 1990s, when 24,000 observations would have constituted a huge data set ...

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453