I have a reasonable understanding of a technique to detect similar documents consisting in first computing their minhash signatures (from their shingles, or n-grams), and then use an LSH-based algorithm to cluster them efficiently (i.e. avoid the quadratic complexity which would entail a naive pairwise exhaustive search).
What I'm trying to do is to bridge three different algorithms, which I think are all related to this minhash + LSH framework, but in non-obvious ways:
(1) The algorithm sketched in Section 3.4.3 of Chapter 3 of the book Mining of Massive Datasets (Rajaraman and Ullman), which seems to be the canonical description of minhashing
(2) Ryan Moulton's Simple Simhashing article
(3) Charikar's so-called SimHash algorihm, described in this article
I find this confusing because what I assume is that although (2) uses the term "simhashing", it's actually doing minhashing in a way similar to (1), but with the crucial difference that a cluster can only be represented by a single signature (even tough multiple hash functions might be involved), while two documents have more chances of being similar with (1), because the signatures can collide in multiple "bands". (3) seems like a different beast altogether, in that the signatures are compared in terms of their Hamming distance, and the LSH technique implies multiple sorting of the signatures, instead of banding them. I also saw (somewhere else) that this last technique can incorporate a notion of weighting, which can be used to put more emphasis on certain document parts, and which seems to lack in (1) and (2).
So at last, my two questions:
(a) Is there a (satisfying) way in which to bridge those three algorithms?
(b) Is there a way to import this notion of weighting from (3) into (1)?