As far as I understand one of the main functions of the LSH method is data reduction even beyond the underlying hashes (often minhashes). I have been using the textreuse
package in R, and I am surprised by the size of the data it generates. textreuse
is a peer-reviewed ROpenSci package, so I assume it does its job correctly, but my question persists.
Let's say I use 256 permutations and 64 bands for my minhash and LSH functions respectively -- realistic values that are often used to detect with relative certainty (~98%) similarities as low as 50%.
If I hash a random text file using TextReuseTextDocument
(256 perms) and assign it to trtd
, I will have:
object.size(trtd$minhashes)
> 1072 bytes
Now let's create the LSH buckets for this object (64 bands) and assign it to l
, I will have:
object.size(l$buckets)
> 6704 bytes
So, the retained hashes in the LSH buckets are six times larger than the original minhashes. I understand this happens because textreuse
uses a md5 digest to create the bucket hashes.
But isn't this too wasteful / overkill, and can't I improve it? Is it normal that our data reduction technique ends up bloating up to this extent? And isn't it more efficacious to match the documents based on the original hashes (similar to perms = 256 and bands = 256) and then use a threshold to weed out the false positives?
Note that I have reviewed the typical texts such as Mining of Massive Datasets, but this question remains about this particular implementation. Also note that the question is not only out of curiosity, but rather out of need. When you have millions or billions of hashes, these differences become significant.