0

I have some problems with the vegdist function. I want to calculate a distance matrix with jaccard. I have binary data.

The problem is that i have a matrix of 138037 rows (sites) and 89 columns (species). my script is:

library("vegan")
memory.limit(size = 100000) # it gives  1 Tera  from HDD in case ram memory is over
DF=as.data.frame(MODELOS)
DF=na.omit(DF)
DISTAN=vegdist(DF[,2:ncol(DF)],"jaccard")

Or more reproducibly:

nsites <- 138037
nspp <- 89
DF <- matrix(0,nrow=nsites,ncol=nspp)
DISTAN=vegdist(DF,"jaccard")

Almost immediately it produces the error:

Error in double(N * (N - 1)/2) : vector size specified is too large

I think this a memory error, but i don´t know why if I have a pc with 32GB of ram and 1 Tera of HDD.

I also try to do a dist matrix with the function dist from package proxy:

library(proxy)
vector=dist(DF, method = "Jaccard")

it starts to run but when it gets to 10 GB of ram, a window announces that R committed an error and it will close, so it closes and start a new section.

I really don't know what is going on and less how to solve this, can anybody help me?

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • 2
    Please edit your question and the title to be in English, Everything on this website should be written in English (that includes the error messages) – thaJeztah Feb 08 '13 at 23:13
  • 1
    Is there any particular reason why this is flagged PHP? – Mark Baker Feb 08 '13 at 23:16
  • @user2055974 you can set `Sys.setenv(LANG = "en")` in order to reproduce the error message in English. See [this post](http://stackoverflow.com/questions/13575180/how-to-change-the-language-of-errors-in-r) – Jilber Urbina Feb 08 '13 at 23:16
  • You shouldn't cross-post between StackOverflow and R-help: http://article.gmane.org/gmane.comp.lang.r.general/286593 (where Brian Ripley gave you pretty much the same answer as I did, below ...) – Ben Bolker Feb 09 '13 at 14:07

1 Answers1

0

N <- 138037; log10(N*(N-1)/2) shows that you are trying to compute a dist object with 10^9.98 = almost 10^10 (10 billion) distinct elements. The released version of R can only handle objects with fewer than 2^31-1 elements (log10(2^31-1)=9.3), regardless of the amount of memory available. This restriction is relaxed in the development version of R (search for "LONG VECTORS"); see also Max Length for a Vector in R. The bigger question, though is: what do you actually plan to do with a distance matrix with 10 billion distinct elements??? If you explain a little bit more about the context of what you're trying to do, you might get some more useful answers (i.e. not just "why is this happening?" but "what can I do about it?"). Without more context, all I can say is "try switching to the development version of R and see if that helps" (but it might not; long vectors are not supported in all aspects of R, and especially not in code that uses underlying C or FORTRAN sources).

I'm not sure why the proxy::dist gives a different error behavior.

Community
  • 1
  • 1
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • thanks for answer me, I want to do a hierarchical cluster from 138037 pixels of 1 lkm^2 from a study area of colombian Andes. I have distributions models for 89 species so i have a matrix with the pixels in the rows and species in the columns and is full with absence(0)/presence(1) of each species per each pixel. I think the bigger problem is that for agglomeration method in the hierarchical cluster i need the hole matrix so i can´t divided it. – user2055974 Feb 11 '13 at 15:55
  • Also for agglomeration method in hierarchical cluster i need the hole matrix so it can´t be divided I made some profs with smaller area and it works, the code MODELOS=stack(list.files(pattern="*.tif$") DF=as.data.frame(MODELOS) DF=na.omit(DF) DISTAN=vegdist(DF[,2:ncol(DF)],"jaccard") E1=hclust(DISTAN,"ward") – user2055974 Feb 11 '13 at 16:23
  • I don't have time to dig into this right now, but I would suggest that you might ask for help on the `r-sig-ecology@r-project.org` mailing list. I would try to give the broad context for your question: it may be that there is a different technique that would answer your *biological* question effectively, with less computational burden. (Do you really need 1-km pixels, or does the scale of variation appear to be larger than that?) – Ben Bolker Feb 11 '13 at 17:19