I have lots of documents with a minhashed field (based on content similarity) stored in elasticsearch. Now, I would either compare all of them with eachother to get similar (hash) documents, with the Elasticsearch API, but I can't do a fuzzy query because it allows only edit distance of 2 and is therefore useless.
I am also looking for a possible Node.js implementation if it cannot be done in Elasticsearch. My first approach was to retrive all id's and minhash-values (=hex-strings) for every document in Elasticsearch, then store them in an array and sort them by lexicographical order. Then, I would only have to compare the nearest neighbour k-documents based on edit-distance, instead of
n*(n-1)/2
comparisons, so I would get n*k
comparisons only. What do you think of this approach?