Suppose I have five sets I'd like to cluster. I understand that the SimHashing technique described here:
https://moultano.wordpress.com/2010/01/21/simple-simhashing-3kbzhsxyg4467-6/
could yield three clusters ({A}
, {B,C,D}
and {E}
), for instance, if its results were:
A -> h01
B -> h02
C -> h02
D -> h02
E -> h03
Similarly, the MinHashing technique described in the Chapter 3 of the MMDS book:
http://infolab.stanford.edu/~ullman/mmds/ch3.pdf
could also yield the same three clusters if its results were:
A -> h01 - h02 - h03
B -> h04 - h05 - h06
|
C -> h04 - h07 - h08
|
D -> h09 - h10 - h08
E -> h11 - h12 - h13
(Each set corresponds to a MH signature composed of three "bands", and two sets are grouped if at least one of their signature bands is matching. More bands would mean more matching chances.)
However I have several questions related to these:
(1) Can SH be understood as a single band version of MH?
(2) Does MH necessarily imply the use of a data structure like Union-Find to build the clusters?
(3) Am I right in thinking that the clusters, in both techniques, are actually "pre-clusters", in the sense that they are just sets of "candidate pairs"?
(4) If (3) is true, does it imply that I still have to do an O(n^2)
search inside each "pre-cluster", to partition them further into "real" clusters? (which might be reasonable if I have a lot of small and fairly balanced pre-clusters, not so much otherwise)