7

The problem:

Suppose I have a group of around 1,000,000 short documents D (no more than 50 words each), and I want to let users to supply a document from the same group D, and and get the top K similar documents from D.

My approach:

My first approach was to preprocess the group D by applying simple tf-idf, and after I have vector for each document, which is extremely sparse, to use a simple nearest neighbours algorithm based on cosine similarity. Then, on query time, to justuse my static nearest neighbours table which its size is 1,000,000 x K, without any further calculations.

After applying tf-idf, I got vectors in size ~200,000, which means now I have a very sparse table (that can be stored efficiently in memory using sparse vectors) in size 1,000,000 x 200,000. However, calculating the nearest neighbours model took me more than one day, and still haven't finished. I tried to lower the vectors dimension by applying HashingTF, that utilizes the hasing trick, instead, so I can set the dimension to a constant one (in my case, i used 2^13 for uninfied hashing), but still I get the same bad performance.

Some technical information:

I use Spark 2.0 for the tf-idf calculation, and sklearn NearestNeighbours on the collected data.

Is thier any more efficient way to achieve that goal?

Thanks in advance.

Edit:

I had an idea to try a LSH based approximation similarity algorithm like those implemented in spark as described here, but could not find one that supports the 'cosine' similarity metric.

barak david
  • 146
  • 8
  • 1
    You might consider to label your documents and also label the document whose similarity you want to study, and check the cosine distance from the documents of the same subset. Unless you reduce the search space in some way you will still have to compute the cosine distance for each of the 1,000,000 documents stored, no way around it. Different algorithms for computing this distance may give slight improvements on the overall speed, but you will still have to pass over all vectors in your collection. – Daneel R. Oct 05 '18 at 06:24
  • You don't say what your application domain is, but you might want to use something like [Elasticsearch](https://www.elastic.co/) to retrieve similar documents, which under the hood uses methods similar to the one you are using, but is heavily optimized to efficiently deal with large amounts of documents. – IonicSolutions Oct 05 '18 at 07:27
  • If you use sklearn NearestNeighbours, you are still using the euclidean metric. If that is fine by you, use [BucketedRandomProjectionLSH](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.BucketedRandomProjectionLSH) – mayank agrawal Oct 05 '18 at 07:53
  • 1
    @mayankagrawal NearestNeighbours in sklearn has an option to choose metric ('cosine' for instance). – barak david Oct 05 '18 at 08:13
  • @DanielR. look at my edit for approximation algorithms, thanks. – barak david Oct 05 '18 at 08:14
  • You can make of use third party packages such as [this](https://github.com/linkedin/scanns). They have a cosine metric for nearest neighbors. – mayank agrawal Oct 05 '18 at 08:19
  • Alternativly, you can go for `mllib.feature.IndexedRowMatrix`'s `columnSimilarities` and develop your own logic for nearest neighbors. Check [this](https://stackoverflow.com/questions/46663775/spark-cosine-distance-between-rows-using-dataframe) – mayank agrawal Oct 05 '18 at 08:22
  • @mayank agrawal thanks, i tried using the indexedrowmatrix too, but i got memory errors. I will check out your other link. – barak david Oct 05 '18 at 10:46
  • I found [this](https://github.com/kayzhu/LSHash/blob/master/lshash/lshash.py) that allows as a parameter for the constructor `dist_func=cosine_dist`, where cosine_dist is defined as `return 1 - np.dot(x, y) / ((np.dot(x, x) * np.dot(y, y)) ** 0.5)` If you don't like it, you may try to change `def cosine_dist(x, y):` to `return cosine_distance(x,y)` from sklearn. – Daneel R. Oct 05 '18 at 16:00

1 Answers1

0

There were some requirements for the algorithm on the relation between training instances and the dimensions of your vectors , but you can try DIMSUM.

You can find the paper here.

Elmar Macek
  • 380
  • 4
  • 12
  • Thank for your answer - I looked over DIMSUM too, but it doesn't fit my needs: DIMSUM assumes that the features (vector) dimension is much bigger than the number of vectors (documents in my case), which is not my case. – barak david Oct 09 '18 at 18:27
  • Sorry to hear. Generally, I assume the main other thing you could probably exploit is the fact that your vectors of high dimensionality are sparse which I guess means, that alot of similarities are 0 - having no feature in common. You could try and iterate over the different features, only calculating the similarities of users that have this feature in common. Those who do not show up in the result, have a similarity of 0 by definition, since they have no feature dimension > 0 in common. For this approach to bring you a calculation speed-up, the features must be very sparse I guess. – Elmar Macek Oct 10 '18 at 06:52
  • Also, I guess you are only interested in non-trivial and meaningful words. So you could try removing high df words in order to increase semantic and sparsity. – Elmar Macek Oct 10 '18 at 07:19