-1

I have created an euclidean distance matrix using dist() function in R. Below is my R script. As the dimensions of matrix would be 16809 * 16809 while running this script in R I got the error message:

Error: cannot allocate vector of size 1.1 Gb

So is there any way to get rid of this error?

I haven't used parallelization in R previously. Can it be done using parallelization?

rnd.points = matrix(runif(3 * 16809), ncol = 3)
rnd.points <- rnd.points[1:5,]
ds <- dist(rnd.points)
as.matrix(ds) -> nt
nt
user3666197
  • 1
  • 6
  • 50
  • 92
seema aswani
  • 177
  • 1
  • 14
  • A square matrix requires n-squared memory. What can anyone / anything do? :) – Gopala Jul 08 '16 at 19:25
  • 1
    Parallelism won't really help if you need to bring all those distance values to one place and store them. – Gopala Jul 08 '16 at 19:27
  • Read this: http://stackoverflow.com/questions/13456639/efficient-memory-wise-function-for-repeated-distance-matrix-calculations-and-c – Boudewijn Aasman Jul 08 '16 at 19:53
  • perhaps of interest http://stackoverflow.com/questions/16190214/dist-function-with-large-number-of-points – user20650 Jul 08 '16 at 19:53
  • Possible duplicate of [Error writing large matrix using R ff](http://stackoverflow.com/questions/30966925/error-writing-large-matrix-using-r-ff) – seema aswani Jul 11 '16 at 05:05

2 Answers2

1

As @Gopola said: dist(.) computes all pairwise distances, and hence needs O(n^2) memory. Indeed, dist() is efficient and only stores half of the symmetric n x n matrix.

If I compute dist() on a computer with enough RAM, it works nicely, and indeed creates an object ds of size 1.1 Gb ... which is not so large for today's computers.

rnd.points <- matrix(runif(3 * 16809), ncol = 3)
ds <- dist(rnd.points)
object.size(ds)

Note however that your

as.matrix(ds) -> nt

is not such a good idea as the resulting matrix nt is indeed (almost) twice the size of ds, as nt is of course a n x n matrix.

Martin Mächler
  • 4,619
  • 27
  • 27
  • I want to create a network from the distance matrix. So for creating a network i want a square matrix that should be my adjacency matrix and based on that matrix i will create a network. – seema aswani Jul 09 '16 at 11:38
0

O/S has a principal limit on RAM-addressing ( smaller for a 32-bit system, larger for 64-bit system )
O/S next has a design-based limit for a max RAM a process can allocate ( +kill-s afterwards )


Had the same InRAM constraints in python and went beyond that

Sure, at some cost, but was a worth piece of experience.

python numpy has a wonderfull feature for this very scenario seamlessly inbuilt - a .memmap(). The word seamlessly is intentionally emphasised, as this is of the core importance for your problem re-formulation / re-design costs. There are tools available, but it will be your time to master 'em and to re-design your algoritm ( libraries et al ) so as these can use the new tools - guess what - SEAMLESSLY. This is the hidden part of the iceberg.


Handy R tools available:

  • filebacked.big.matrix which also supports an HPC cluster-wide sharing for distributed processing ( thus solving both PSPACE and PTIME dimensions of the HPC processing challenge, unless you fortunately hit the filesystem fileSize ceiling )

  • ff which allows
    library(ff)
    pt_coords <- ff( vmode = "double", dim = c(16809, 3), initdata = 0 )
    pt_dists <- ff( vmode = "double", dim = c(16809, 16809), initdata = -1 )
    and work with it in as simple as in matrix-alike [row,column] mode to fill in the points and process their pair-wise distances et al,
    ?ffsave for further details on saving your resulting distances data

and last, but not least

  • mmap + indexing

Parallel? No.
Distributed?
Yes, might help with PTIME:

As noted with filebacked.big.matrix there are chances to segment the computational PSPACE into smaller segments for distributed processing and reduction of the PTIME, but the concept is in principle just a concurrent (re)-use of available resouces, not the [ PARALLEL ] system-behaviour ( while it is necessary to admit, that lot of marketing ( the bad news is that even the technology marketing has joined this unfair and knowingly incorrect practice ) texts mis-uses the word parallel / parallelism in places, where a just concurrent system-behaviour is observed ( there are not many real, true-PARALLEL, systems ) ).


Conclusion:

Big matrices are doable in R well beyond the InRAM limits, select the tools most suitable for your problem-domain and harness all the HPC-resources you may.

Error: cannot allocate vector of size 1.1 Gb is solved.

There is nothing but resources, that imposts limits and delays on our computing-ready tasks, so do not hesitate to make your move while computing resources are still available for your Project, otherwise you will find yourself, with all the re-engineered software ready, but waiting in a queue for the computing resources.

Community
  • 1
  • 1
user3666197
  • 1
  • 6
  • 50
  • 92
  • Thank you for sharing your knowledge. What is ff here is it an inbuilt function in R..? – seema aswani Jul 09 '16 at 19:14
  • I want to find a distance matrix of uniformly distributed 16809 points in 3D space. So how can i achieve it using big.matrix. Can you provide an example R code what you have explained. – seema aswani Jul 09 '16 at 19:26
  • **`library(ff)`** is the answer for the first comment and updated the text, check the library documentation, both for `ff` and `filebacked.big.matrix` for the libraries syntax details. Euclidean distance in R^3? Elementary. – user3666197 Jul 09 '16 at 20:50
  • Yes euclidean distance in R^3. How to associate distance matrix and feed its points in to ff matrix.? – seema aswani Jul 10 '16 at 05:36
  • Dear @seemaaswani StackOverflow is not a do-my-homework site. **Your indicated MCVE error was solved** to a reasonable level of details, incl. alternative ways to direct your further Project development. **StackOverflow does not consider a practice of "give me a code" as a fair Community relation.** Develop your further efforts in directions necessary to extend your achievents ( ref. also to your duplicate attempts to ask for a code for the Euclidean distance in other post >>> http://stackoverflow.com/questions/37549154/code-to-generate-3d-random-geometric-graph-in-r ) – user3666197 Jul 10 '16 at 06:54
  • OK. The question i asked was not to feed total solution for my problem but a solution which i can find for my problem.. The link you shared in above comment was the attempt i have provided previously. In the post it self i have asked about whether its right way of doing it or not. As i am not that strong in coding. As the stack overflow community has wide number of experts my questions are always aimed to have some kind of suggestion or solution not for do my homework or feed me a full code. – seema aswani Jul 10 '16 at 07:20