Why does textreuse packge in R make LSH buckets way larger than the original minhashes?

Question

As far as I understand one of the main functions of the LSH method is data reduction even beyond the underlying hashes (often minhashes). I have been using the textreuse package in R, and I am surprised by the size of the data it generates. textreuse is a peer-reviewed ROpenSci package, so I assume it does its job correctly, but my question persists.

Let's say I use 256 permutations and 64 bands for my minhash and LSH functions respectively -- realistic values that are often used to detect with relative certainty (~98%) similarities as low as 50%.

If I hash a random text file using TextReuseTextDocument (256 perms) and assign it to trtd, I will have:

object.size(trtd$minhashes)
> 1072 bytes

Now let's create the LSH buckets for this object (64 bands) and assign it to l, I will have:

object.size(l$buckets)
> 6704 bytes

So, the retained hashes in the LSH buckets are six times larger than the original minhashes. I understand this happens because textreuse uses a md5 digest to create the bucket hashes.

But isn't this too wasteful / overkill, and can't I improve it? Is it normal that our data reduction technique ends up bloating up to this extent? And isn't it more efficacious to match the documents based on the original hashes (similar to perms = 256 and bands = 256) and then use a threshold to weed out the false positives?

Note that I have reviewed the typical texts such as Mining of Massive Datasets, but this question remains about this particular implementation. Also note that the question is not only out of curiosity, but rather out of need. When you have millions or billions of hashes, these differences become significant.

score 1 · Accepted Answer · answered Aug 16 '20 at 20:24

1

Package author here. Yes, it would be wasteful to use more hashes/bands than you need. (Though keep in mind we are talking about kilobytes here, which could be much smaller than the original documents.)

The question is, what do you need? If you need to find only matches that are close to identical (i.e., with a Jaccard score close to 1.0), then you don't need a particularly sensitive search. If, however, you need to reliable detect potential matches that only share a partial overlap (i.e., with a Jaccard score that is closer to 0), then you need more hashes/bands.

Since you've read MMD, you can look up the equation there. But there are two functions in the package, documented here, which can help you calculate how many hashes/bands you need. lsh_threshold() will calculate the threshold Jaccard score that will be detected; while lsh_probability() will tell you how likely it is that a pair of documents with a given Jaccard score will be detected. Play around with those two functions until you get the number of hashes/bands that is optimal for your search problem.

answered Aug 16 '20 at 20:24

Lincoln Mullen

6,257
4
27
30

Nice to meet you Lincoln! I am well-attuned to those functions. What I mean is: Why do 64 bands take so much more space than the original 256 hashes? That is one band for every 4 minhashes, and still takes 6 times the space. I thought we did banding as a means of data reduction. Why then not use the original minhashes (rather than the md5-hashed bands) for matching purposes when high accuracy at low similarity levels is required? – retrography Aug 17 '20 at 10:38
1

Because the bands are strings and the minhashes are integers. We don't do the banding as a means of data reduction; we do it to identify potential matches. As an implementation detail, I've chosen to do that by additionally hashing the minhashes in the bands to make comparison easier. Perhaps there is another way to do the implementation that would be more efficient in terms of space. In practice, I find that the size of the text is vastly bigger than any of the derivative objects created by this package. – Lincoln Mullen Aug 17 '20 at 14:32
That additional cryptographic hashing is exactly where my question falls -- an implementation detail that I am not sure how else can be addressed. I will look into the other implementations to see what they have done. For instance, I wonder whether simply using the minhashes as the index (band size of 1) will create undesirable key collisions beyond the expected false positives. And for sure, I agree that the hashes are often smaller than the text. But at 6KB per bucket you run out of memory pretty fast with just a few million documents. – retrography Aug 17 '20 at 14:47
The package as a whole could stand to be reimplemented. It is basically intended for in memory corpora, and it was written back with the objects from the tm package were the de facto standard for text mining in R. I would like to rewrite it so that it takes a tidyverse approach from the beginning and so that it can use a database or other backend to allow for arbitrary sized data. I probably won't be able to tackle that until I need to use the package myself for something intensive. But you are welcome to open a GitHub issue, talk it through, and contribute a pull request. – Lincoln Mullen Aug 17 '20 at 18:12

Why does textreuse packge in R make LSH buckets way larger than the original minhashes?

1 Answers1