1

I have a very large csv file (about 91 million rows so a for loop takes too long in R) of similarities between keywords (about 50,000 unique keywords) that when I read into a data.frame looks like:

> df   
kwd1 kwd2 similarity  
a  b  1  
b  a  1  
c  a  2  
a  c  2 

It is a sparse list and I can convert it into a sparse matrix using sparseMatrix():

> myMatrix 
  a b c  
a . 1 2
b 1 . .
c 2 . .

However, now I would like to convert this into a dist object. I tried as.dist(myMatrix) but I was given the error that the 'problem was too large' for as.dist(). I also tried converting the sparse matrix to a lower triangular sparse matrix then to a dist object (thinking this might be better) using myMatrix = myMatrix * lower.tri(myMatrix), but I then had the same error but with regard to the lower.tri function.

Thanks for any help!

rfoley
  • 345
  • 3
  • 15
  • the first links example does not work for a list of sparse distances. I do not have the pairwise distances for all combinations of keywords. – rfoley Sep 11 '12 at 23:36
  • I don't think that these questions had viable solutions, but I will look at them again. – rfoley Sep 11 '12 at 23:38
  • The issues here are very different to the ones linked to in other comments. Following them will get the OP nowhere. – Gavin Simpson Sep 12 '12 at 08:04
  • @mnel I've looked at the first link you suggested. However, I don't think that either of the answers to that question will work because I do not have a similarity, or distance, between each keyword. – rfoley Sep 12 '12 at 16:40

1 Answers1

2

An object of class "dist" is a dense object. To go from the sparse representation will require a vector on the order of

R> 0.5*(91000000*90999999)
[1] 4.1405e+15

elements (give or take for the diagonal). In R, the maximum length of a vector is 2^31 - 1:

R> 2^31 - 1
[1] 2147483647

which is way smaller than the number of elements you need to store the dense "dist" object so it won't be possible and that is the reason for the error from dist(). For similar reasons you won't be able to store the lower triangle version of the data as a dense object as it too is held as a vector with the same length limits.

At this point I think you'll need to explain more about the actual problem and what you want the dissimilarity object for (in another Question!)? Do you need all dissimilarities between the 91 million objects or could you get by with a sample from this that will fit into the current length limitations for R's vectors?

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
  • Thanks for the response! Sorry if I wasn't clear in the post, but I have about 50,000 keywords, or objects, with about 91,000,000 pairwise similarities - not 91 million objects. Do you think that R could handle this case then? – rfoley Sep 12 '12 at 16:44
  • Also, the dist object should be okay for my dataset because it leaves out distances between objects that are not present. – rfoley Sep 12 '12 at 17:09
  • I am attempting to use fastclust on my data to do some hierarchical clustering. – rfoley Sep 12 '12 at 17:15
  • I think you misunderstand what a dist object does. It is a vector containing the upper or lower triangle elements of a full, dense, dissimilarity matrix. In your `myMatrix` sparse matrix, you don't have the distance for b-c (or c-b) but a dist object will require that information (for one of the pair as it is just storing one triangle of the matrix). The User manual for fastcluter mentions dist being a condensed representation, but it is only condensed in the sense that it stores just one of each pair. It is not sparse! – Gavin Simpson Sep 12 '12 at 17:24
  • Per your suggestion, I created another question more along the lines of my ultimate goal: http://stackoverflow.com/questions/12393466/hierarchical-clustering-large-sparse-distance-matrix-r – rfoley Sep 12 '12 at 17:49
  • I think the dist object should work with 50,000 objects given that it should be the size 50000(50000-1)/2 = 1249975000 which is within the limit you mentioned for vectors in R. I tried using the with statement you suggested for the other post, but with my example dataset it produces the wrong distances. Maybe I am doing it wrong? – rfoley Sep 12 '12 at 18:04
  • Ok I have an idea, if I can convert my sparse matrix to a data frame with all the pairwise distances, then I can create a dist object with with() as you suggested in the other question. Here is my new posting: http://stackoverflow.com/questions/12394588/large-sparse-matrix-of-distances-to-dataframe-in-r – rfoley Sep 12 '12 at 19:07
  • I think you have the wrong impression about the dimensions of the data or am I mistaken? – rfoley Sep 12 '12 at 20:39
  • No, I don't think so. Consider the dense matrix version on myMatix. dist objects store the entire lower triangle of the dense version if myMatix, even the non diagonal entries in myMatrix represented by a . If you can form the dense lower tri of myMatrix then creating a dist object might work, but your comments here & elsewhere suggest you can't even do that. The key is a dist object is dense and R has limits on the size of vectors. – Gavin Simpson Sep 12 '12 at 22:45