Questions tagged [minhash]

MinHash is a probabilistic hashing technique for quickly estimating how similar two sets are.

MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are.

The idea of the MinHash scheme is to reduce the variance by averaging together several variables constructed in the same way.

81 questions
21
votes
4 answers

Can you suggest a good minhash implementation?

I am trying to look for a minhash open source implementation which I can leverage for my work. The functionality I need is very simple, given a set as input, the implementation should return its minhash. A python or C implementation would be…
Atish Kathpal
  • 693
  • 2
  • 7
  • 20
19
votes
2 answers

Choosing between SimHash and MinHash for a production system

I'm familiar with the LSH (Locality Sensitive Hashing) techniques of SimHash and MinHash. SimHash uses cosine similarity over real-valued data. MinHash calculates resemblance similarity over binary vectors. But I can't decide which one would be…
Brian Spiering
  • 1,002
  • 1
  • 9
  • 18
17
votes
1 answer

Locality-sensitive hashing - Elasticsearch

is there any plugin allowing LSH on Elasticsearch? If yes, could you point me to the location and tell me a little how to use it? Thanks Edit: I found out that ES uses MinHash plugin. How could I compare documents to one another with this? What…
10
votes
2 answers

Best Method to Intersect Huge HyperLogLogs in Redis

The problem is simple: I need to find the optimal strategy to implement accurate HyperLogLog unions based on Redis' representation thereof--this includes handling their sparse/dense representations if the data structure is exported for use…
Julian
  • 1,406
  • 1
  • 13
  • 24
10
votes
2 answers

Using MinHash to find similarities between 2 images

I am using MinHash algorithm to find similar images between images. I have run across this post, How can I recognize slightly modified images? which pointed me to MinHash algorithm. I was using a C# implementation from this blog post, Set Similarity…
dance2die
  • 35,807
  • 39
  • 131
  • 194
7
votes
2 answers

Implementation of locality-sensitive hashing with min-hash

I have read a lot of tutorials, documents, and pieces of code implementing LSH (locality-sensitive hashing) with min-hash. LSH tries to find the Jaccard coefficient of two sets by hashing random subsets and aggregating over those. I have looked at…
Majic Johnson
  • 341
  • 3
  • 15
6
votes
1 answer

Strange performance issue Spark LSH MinHash approxSimilarityJoin

I'm joining 2 datasets using Apache Spark ML LSH's approxSimilarityJoin method, but I'm seeings some strange behaviour. After the (inner) join the dataset is a bit skewed, however every time one or more tasks take an inordinate amount of time to…
Tom Lous
  • 2,819
  • 2
  • 25
  • 46
6
votes
2 answers

How to hash vectors into buckets in Locality Sensitive Hashing (using jaccard distance)?

I am implementing a near-neighbor search application which will find similar documents. So far I have read a good portion of LSH related materials (theory behind LSH is some kind of confusing and I am not able to comphrened it 100% yet). My code is…
tozak
  • 401
  • 4
  • 12
5
votes
2 answers

Minhash implementation how to find hash functions for permutations

I have a problem implementing minhashing. On paper and from reading I understand the concept, but my problem is the permutation "trick". Instead of permuting the matrix of sets and values the suggestion for implementation is: "pick k (e.g. 100)…
user2359192
  • 71
  • 1
  • 3
4
votes
1 answer

What value to use for numHashTable in Spark LSH by Uber?

I'm trying to use .approxSimilarityJoin of Spark MLlib LSH: MinHash for Jaccard Distance e.g. val mh = new MinHashLSH() .setNumHashTables(5) .setInputCol("features") .setOutputCol("hashes") I understand that the higher the…
4
votes
1 answer

k-means using signature matrix generated from minhash

I have used minhash on documents and their shingles to generate a signature matrix from these documents. I have verified that the signature matrices are good as comparing jaccard distances of known similar documents (say, two articles about the same…
4
votes
1 answer

How do I find the k-nearest values in n-dimensional space?

I read about kd-trees but they are inefficient when the dimensionality of the space is high. I have a database of value and I want to find the values that are within a certain hamming distance of the query. For instance, the database is a list of…
4
votes
3 answers

Relationship between (1) hash function, (2) length of signature and (3) jaccard similarity?

I am trying to understand/implement minHash based jaccard similarity in python. The main goal is use it in MapReduce. However I am not clear how the choice of hash function and length of signature affects error rate in computing jaccard similarity.…
3
votes
1 answer

LSH Spark stucks forever at approxSimilarityJoin() function

I am trying to implement LSH spark to find nearest neighbours for each user on very large datasets containing 50000 rows and ~5000 features for each row. Here is the code related to this. MinHashLSH mh = new…
3
votes
2 answers

What more advantageous minhash over simhash?

I am working with simhash but also see minhash is more effective. But I don't understand. Please explain for me: What more advantageous minhash over simhash ?
xfr1end
  • 303
  • 5
  • 8
1
2 3 4 5 6