4

I currently want to calculate all-pair document similarity using cosine similarity and Tfidf features in python. My basic approach is the following:

from sklearn.feature_extraction.text import TfidfVectorizer
#c = [doc1, doc2, ..., docn]
vec = TfidfVectorizer()
X = vec.fit_transform(c)
del vec
Y = X * X.T

Works perfectly fine, but unfortunately, not for my very large datasets. X has a dimension of (350363, 2526183) and hence, the output matrix Y should have (350363, 350363). X is very sparse due to the tfidf features, and hence, easily fits into memory (around 2GB only). Yet, the multiplication gives me a memory error after running for some time (even though the memory is not full but I suppose that scipy is so clever as to expect the memory usage).

I have already tried to play around with the dtypes without any success. I have also made sure that numpy and scipy have their BLAS libraries linked -- whereas this does not have an effect on the csr_matrix dot functionality as it is implemented in C. I thought of maybe using things like memmap, but I am not sure about that.

Does anyone have an idea of how to best approach this?

fsociety
  • 1,791
  • 4
  • 22
  • 32

3 Answers3

7

Even though X is sparse, X * X.T probably won't, notice, that it just needs one nonzero common element in a given pair of rows. You are working with NLP task, so I am pretty sure that there are huge amounts of words which occur in nearly all documents (and as said before - it does not have to be one word for all pairs, but one (possibly different) for each pair. As a result you get a matrix of 350363^2 elements which has about 122,000,000,000 elements, if you don't have 200GB of ram, it does not look computable. Try to perform much more aggresive filtering of words in order to force X * X.T to be sparse (remove many common words)

In general you won't be able to compute Gram matrix of big data, unless you enforce the sparsity of the X * X.T, so most of your vectors' pairs (documents) have 0 "similarity". It can be done in numerous ways, the easiest way is to set some threshold T under which you treat <a,b> as 0 and compute the dot product by yourself, and create an entry in the resulting sparse matrix iff <a,b> > T

lejlot
  • 64,777
  • 8
  • 131
  • 164
  • Yes, that's what I figured and also mentioned above. However, filtering words does not change anything as the resulting matrix still would be `#documents * #documents`. – fsociety Aug 03 '14 at 12:53
  • You are wrong. If the resulting matrix (`X * X.T`, not `X` itself) is sparse (which is an effect of aggresive filtering) then it won't be `#documents * #documents` in terms of memory – lejlot Aug 03 '14 at 12:54
  • Yeah right, but I still have a high chance of at least one non-zero common pair. Anyhow, I am looking for solutions to make this work without much dimensionality reduction of features. – fsociety Aug 03 '14 at 12:55
  • Of course you have, but you are not dealing with the problem of efficiency, but simply you are trying to solve the unsolvable, you cannot have "full Gram matrix of big data" unless you have really big RAM. You can either approximate it (as stated in the answer), buy more RAM or compute the required dot products "on the fly" instead of a matrix. You seem to not understand what "-1" vote on answer means, it is not supposed to be a notifier of "I would prefer different approach", but a sign, that given answer is incorrect in the sense of stated problem. – lejlot Aug 03 '14 at 13:01
  • Was on accidant, but cannot revert it to zero, so here you have your upvote. Anyhow, you are not trying to provide an answer given the question. It is definitely not unresolvable as you claim it to be. I am looking for a way of approaching it and even gave an idea of using things like memmap; maybe PyTables would also work. Hoping for further answers. – fsociety Aug 03 '14 at 13:15
6

You may want to look at the random_projection module in scikit-learn. The Johnson-Lindenstrauss lemma says that a random projection matrix is guaranteed to preserve pairwise distances up to some tolerance eta, which is a hyperparameter when you calculate the number of random projections needed.

To cut a long story short, the scikit-learn class SparseRandomProjection seen here is a transformer to do this for you. If you run it on X after vec.fit_transform you should see a fairly large reduction in feature size.

The formula from sklearn.random_projection.johnson_lindenstrauss_min_dim shows that to preserve up to a 10% tolerance, you only need johnson_lindenstrauss_min_dim(350363, .1) 10942 features. This is an upper bound, so you may be able to get away with much less. Even 1% tolerance would only need johnson_lindenstrauss_min_dim(350363, .01) 1028192 features which is still significantly less than you have right now.

EDIT: Simple thing to try - if your data is dtype='float64', try using 'float32'. That alone can save a massive amount of space, especially if you do not need double precision.

If the issue is that you cannot store the "final matrix" in memory either, I would recommend working with the data in an HDF5Store (as seen in pandas using pytables). This link has some good starter code, and you could iteratively calculate chunks of your dot product and write to disk. I have been using this extensively in a recent project on a 45GB dataset, and could provide more help if you decide to go this route.

Kyle Kastner
  • 1,008
  • 8
  • 7
  • This looks very great, did not know about that. I am not sure whether this solves my problem though, as I still have to calculate the final `(350363, 350363)` matrix. Can you quickly explain what exactly the tolerance parameter states? – fsociety Aug 04 '14 at 09:44
  • The tolerance parameter is the "noise" in the preservation of pairwise distances - basically the pairwise distances you get out of the reduced matrix are with ```eta``` of the original dataset - in this case 10% or 1% or whatever you choose. I was unclear from your post if the final storage was causing the error, or the intermediate calculation. If it is the intermediate, then this will help. If you can't even store the final result that is whole separate issue. – Kyle Kastner Aug 04 '14 at 09:54
  • In the latter case I also have recommendations - will add an edit to my post! – Kyle Kastner Aug 04 '14 at 09:54
  • Yes, my feature matrix can be stored quite easily as it is really sparse, however, the bottleneck lies within the resulting matrix. I thought about using something like PyTables or even numpy's memmap as well, maybe I should really go this router. However, it is not so trivial to do so while working with sparse matrices. So maybe I should go two ways: (a) use the dimenionsality reduction and work with a dense matrix afterwards and (b) then use something like an HDF5Store. I already tried using `np.float32`btw. – fsociety Aug 04 '14 at 10:42
  • There is a nice answer for sparse data [here](http://stackoverflow.com/questions/11129429/storing-numpy-sparse-matrix-in-hdf5-pytables/22589030#22589030) which may be of help in your case, if you decide to work directly with sparse data. – Kyle Kastner Aug 04 '14 at 11:38
  • One more thing to consider is looking for a "kernelized" version of your classifier/regressor/clustering solution. You may be able to do the work in an "implicit space" by setting up a cosine similarity kernel, and doing your task (PCA/linear regression/k-means) like normal. – Kyle Kastner Aug 04 '14 at 11:47
  • Will look into it. Another question: How is the random projection working with L2 normalized features? I now simply tried to use `SparseRandomProjection` with the default `eta=.1` and then multiplied the matrix with its transpose on some sample rows. The self-distance is outside the 10% range though. So e.g., `result[0,0]` should be `1.` but is `0.75`. – fsociety Aug 04 '14 at 11:49
  • Kernalized doesn't work for me as I need the exact similarity scores between the documents for something else. – fsociety Aug 04 '14 at 11:52
  • You may need to "renormalize", as I am not sure that the random matrix has any guarantees for being normalized. You can check tf.components_ to see. There is a nice example [here](http://scikit-learn.org/stable/auto_examples/plot_johnson_lindenstrauss_bound.html#example-plot-johnson-lindenstrauss-bound-py) with more detail. If you are in need of the exact pairwise distances and not an approximation, I would recommend HDF5Store as a better (though more work) solution. – Kyle Kastner Aug 04 '14 at 12:08
  • Ok so I need to find a way to compute the dot of two sparse matrices with PyTables. – fsociety Aug 04 '14 at 12:44
  • I would begin by creating an HDF5Store of zeros on disk, then slice some part of each of your X, X.T matrices, do the dot, then slice into the correct part of the HDF5Store to assign the result. You can change the size of the slices to control your memory usage, and you can use symmetry to avoid recomputing. I feel like there should be builtin support for this in some library, but am unaware of any which do this automagically... – Kyle Kastner Aug 04 '14 at 12:51
  • This seems to be a good solution: http://stackoverflow.com/questions/19684575/matrix-multiplication-using-hdf5 – fsociety Aug 04 '14 at 13:01
1

What you could do is slice a row and a column of X, multiply those and save the resulting row to a file. Then move to the next row and column.

It is still the same amount of calculation work but you wouldn't run out of memory.

Using multiprocessing.Pool.map() or multiprocessing.Pool.map_async() you migt be able to speed it up, provided you use numpy.memmap() to read the matrix in the mapped function. And you would probably have to write each of the calculated rows to a separate file to merge them later. If you were to return the row from the mapped function it would have to be transferred back to the original process. That would take a lot of memory and IPC bandwidth.

Roland Smith
  • 42,427
  • 3
  • 64
  • 94
  • 1
    Thought about something along those lines. It will be potentially slower then though, but I guess that's how it is. Maybe I can parrallelize it. I could use numpy's memmap then as well. – fsociety Aug 03 '14 at 15:49
  • @ph_singer You may be able to speed up the computation itself by parallelizing it, but if you have to write the output to a file, you'll have additional IO overhead that no amount of parallelism will help with. Benchmark the single-threaded version first before you dive into parallelization. – ali_m Aug 03 '14 at 16:20
  • I have added a possible strategy for parallelizing to my answer. Since N-core machines are quite common these days, you could expect an N-fold improvement on the pure calculation times. But memmap-ing the matrix in every process and writing the result to disk will slow things down considerably. – Roland Smith Aug 03 '14 at 16:43
  • Thanks, I will start with a single-threaded version. I can also reduce the calculation by half as the resulting matrix is symmetric. – fsociety Aug 03 '14 at 16:54
  • I just figured that slicing is super slow (around 800ms per slice). Hence, this approach does not really seem to work as well :/ Any ideas of how to cope with that? I already do efficient row slicing with a csr_matrix. – fsociety Aug 03 '14 at 19:02
  • @ph_singer Indexing speed will vary depending on the sparse format of the array. Indexing rows of a CSR ("compressed sparse row") matrix will be much faster than indexing columns, and vice versa for CSC. It might be faster to keep `X` as CSR and create a separate copy of `X.T` as a CSC matrix, provided that both will fit in memory. – ali_m Aug 04 '14 at 00:58
  • You can use bigger slices to speed this up a lot. E.g. if you have enough space for `1000*n_docs` floats, then compute `X[:1000] * X.T`, then `X[1000:2000] * X.T`, etc. – Fred Foo Aug 04 '14 at 12:49