Questions tagged [minhash]

MinHash is a probabilistic hashing technique for quickly estimating how similar two sets are.

MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are.

The idea of the MinHash scheme is to reduce the variance by averaging together several variables constructed in the same way.

81 questions

votes

4 answers

Can you suggest a good minhash implementation?

I am trying to look for a minhash open source implementation which I can leverage for my work. The functionality I need is very simple, given a set as input, the implementation should return its minhash. A python or C implementation would be…

python hash minhash

asked Jan 26 '13 at 03:01

Atish Kathpal

votes

2 answers

Choosing between SimHash and MinHash for a production system

I'm familiar with the LSH (Locality Sensitive Hashing) techniques of SimHash and MinHash. SimHash uses cosine similarity over real-valued data. MinHash calculates resemblance similarity over binary vectors. But I can't decide which one would be…

minhash simhash

asked Dec 30 '14 at 20:59

Brian Spiering

1,002
1
9
18

votes

1 answer

Locality-sensitive hashing - Elasticsearch

is there any plugin allowing LSH on Elasticsearch? If yes, could you point me to the location and tell me a little how to use it? Thanks Edit: I found out that ES uses MinHash plugin. How could I compare documents to one another with this? What…

elasticsearch locality-sensitive-hash minhash

asked Sep 25 '15 at 08:08

Sebastian Winkler

votes

2 answers

Best Method to Intersect Huge HyperLogLogs in Redis

The problem is simple: I need to find the optimal strategy to implement accurate HyperLogLog unions based on Redis' representation thereof--this includes handling their sparse/dense representations if the data structure is exported for use…

hash redis hashtable hyperloglog minhash

asked May 07 '15 at 16:20

Julian

1,406
1
13
24

votes

2 answers

Using MinHash to find similarities between 2 images

I am using MinHash algorithm to find similar images between images. I have run across this post, How can I recognize slightly modified images? which pointed me to MinHash algorithm. I was using a C# implementation from this blog post, Set Similarity…

c# image-processing fuzzy-comparison minhash

asked May 03 '10 at 14:54

dance2die

35,807
39
131
194

votes

2 answers

Implementation of locality-sensitive hashing with min-hash

I have read a lot of tutorials, documents, and pieces of code implementing LSH (locality-sensitive hashing) with min-hash. LSH tries to find the Jaccard coefficient of two sets by hashing random subsets and aggregating over those. I have looked at…

algorithm locality-sensitive-hash minhash

asked Jan 07 '13 at 21:11

Majic Johnson

votes

1 answer

Strange performance issue Spark LSH MinHash approxSimilarityJoin

I'm joining 2 datasets using Apache Spark ML LSH's approxSimilarityJoin method, but I'm seeings some strange behaviour. After the (inner) join the dataset is a bit skewed, however every time one or more tasks take an inordinate amount of time to…

apache-spark duplicates apache-spark-mllib minhash lsh

asked Jul 18 '18 at 13:47

Tom Lous

2,819
2
25
46

votes

2 answers

How to hash vectors into buckets in Locality Sensitive Hashing (using jaccard distance)?

I am implementing a near-neighbor search application which will find similar documents. So far I have read a good portion of LSH related materials (theory behind LSH is some kind of confusing and I am not able to comphrened it 100% yet). My code is…

c hash machine-learning locality-sensitive-hash minhash

asked Apr 08 '14 at 20:04

tozak

votes

2 answers

Minhash implementation how to find hash functions for permutations

I have a problem implementing minhashing. On paper and from reading I understand the concept, but my problem is the permutation "trick". Instead of permuting the matrix of sets and values the suggestion for implementation is: "pick k (e.g. 100)…

algorithm implementation hash-function minhash

asked Sep 24 '13 at 08:50

user2359192

votes

1 answer

What value to use for numHashTable in Spark LSH by Uber?

I'm trying to use .approxSimilarityJoin of Spark MLlib LSH: MinHash for Jaccard Distance e.g. val mh = new MinHashLSH() .setNumHashTables(5) .setInputCol("features") .setOutputCol("hashes") I understand that the higher the…

scala apache-spark apache-spark-mllib locality-sensitive-hash minhash

asked Nov 21 '17 at 18:02

Marsellus Wallace

17,991
25
90
154

votes

1 answer

k-means using signature matrix generated from minhash

I have used minhash on documents and their shingles to generate a signature matrix from these documents. I have verified that the signature matrices are good as comparing jaccard distances of known similar documents (say, two articles about the same…

python text cluster-analysis k-means minhash

asked Oct 21 '17 at 19:17

Nick Greenquist

votes

1 answer

How do I find the k-nearest values in n-dimensional space?

I read about kd-trees but they are inefficient when the dimensionality of the space is high. I have a database of value and I want to find the values that are within a certain hamming distance of the query. For instance, the database is a list of…

computational-geometry multivariate-partition minhash hamming-distance

asked Mar 06 '10 at 13:49

Eyal

5,728
7
43
70

votes

3 answers

Relationship between (1) hash function, (2) length of signature and (3) jaccard similarity?

I am trying to understand/implement minHash based jaccard similarity in python. The main goal is use it in MapReduce. However I am not clear how the choice of hash function and length of signature affects error rate in computing jaccard similarity.…

python hash-function minhash

asked Dec 29 '13 at 01:29

user3142979

votes

1 answer

LSH Spark stucks forever at approxSimilarityJoin() function

I am trying to implement LSH spark to find nearest neighbours for each user on very large datasets containing 50000 rows and ~5000 features for each row. Here is the code related to this. MinHashLSH mh = new…

java apache-spark locality-sensitive-hash minhash

asked Feb 22 '18 at 12:17

Goutham Panneeru

votes

2 answers

What more advantageous minhash over simhash?

I am working with simhash but also see minhash is more effective. But I don't understand. Please explain for me: What more advantageous minhash over simhash ?

similarity cosine-similarity minhash simhash

asked Apr 15 '16 at 12:35

xfr1end

2 3 4 5 6 Next