How to compare millions of minhashed documents on elasticsearch?

Question

I have lots of documents with a minhashed field (based on content similarity) stored in elasticsearch. Now, I would either compare all of them with eachother to get similar (hash) documents, with the Elasticsearch API, but I can't do a fuzzy query because it allows only edit distance of 2 and is therefore useless.

I am also looking for a possible Node.js implementation if it cannot be done in Elasticsearch. My first approach was to retrive all id's and minhash-values (=hex-strings) for every document in Elasticsearch, then store them in an array and sort them by lexicographical order. Then, I would only have to compare the nearest neighbour k-documents based on edit-distance, instead of

n*(n-1)/2 comparisons, so I would get n*k comparisons only. What do you think of this approach?

This node.js module might help: https://github.com/duhaime/minhash — Val, Mar 25 '19 at 08:42
I was already using exactly this module ;). But now I have the problem how to compare those hashed documents efficiently. Because I stored the hashes as terms in Elasticsearch, but don't know how to compare them because they are not stored in similar "buckets" for similar hash values.. I only have the plain minhash values.... thats my dilemma. And Elasticsearch allows a fuzzy search for comparison only up to 2 edit distances, which is useless in my case.. — MMMM, Mar 25 '19 at 08:44
ok but you can still compute the similarity between each pair by calling the `jaccard()` method for the KNN, right? — Val, Mar 25 '19 at 08:50
yeah sure, but as the minhash value is stored inside each document, this would exactly make me about n^2 comparisons when comparing each pair which is too inefficient. I thought there was some kind of "trick" with LSH and with some sort of bucketing approach, but not sure how to interpret that from the papers... didn't understand it so well I guess.. — MMMM, Mar 25 '19 at 08:52
Also see this answer: https://stackoverflow.com/a/41254259/4604579 — Val, Mar 25 '19 at 08:53
@Val thanks for the idea, but I also gotten this far to use the plugin, however, I am stuck as the plugin has type field "minhash" where you cannot do a fuzzy search or search in general (only on type text or term fields). And the "copy to bits" does not work I was stuck there altough I used the operator like in the examples, and on github the creator doesn't answer atm... — MMMM, Mar 25 '19 at 08:55
more_like_this doesnt work on minhash type either, but I could try it on my custom minhash field, I'm not sure how MLT works, but I think it is also some kind of cosinus similarity, but then you would still do about n^2 comparisons I think.. — MMMM, Mar 25 '19 at 09:00

How to compare millions of minhashed documents on elasticsearch?

0 Answers0