Questions tagged [locality-sensitive-hash]

Locality-sensitive hashing (LSH) is a method of probabilistic dimension reduction.

Locality-sensitive hashing (LSH) is a method of performing probabilistic dimension reduction of high-dimensional data. The basic idea is to hash the input items so that similar items are mapped to the same buckets with high probability (the number of buckets being much smaller than the universe of possible input items).

97 questions
159
votes
6 answers

How to understand Locality Sensitive Hashing?

I noticed that LSH seems a good way to find similar items with high-dimension properties. After reading the paper http://www.slaney.org/malcolm/yahoo/Slaney2008-LSHTutorial.pdf, I'm still confused with those formulas. Does anyone know a blog or…
WoooHaaaa
  • 19,732
  • 32
  • 90
  • 138
27
votes
1 answer

Approximate String Matching using LSH

I would like to approximately match Strings using Locality sensitive hashing. I have many Strings>10M that may contain typos. For every String I would like to make a comparison with all the other strings and select those with an edit distance…
nikosdi
  • 2,138
  • 5
  • 26
  • 35
22
votes
5 answers

LSH Libraries in Java

I'm looking for a lightweight Java library that supports Nearest Neighbor Searches by Locality Sensitive Hashing for nearly equally distributed data in a high dimensional (in my case 32) dataset with some hundreds of thousands data points. It's…
s1lence
  • 2,188
  • 2
  • 16
  • 34
20
votes
4 answers

Locality Sensitive Hash Implementation?

Are there any relatively simple to understand (and simple to implement) locality-sensitive hash examples in C/C++/Java/C#? I'd like to learn more about the concept and so want to try an implementation on a few text files just to see how it works, so…
user541686
  • 205,094
  • 128
  • 528
  • 886
17
votes
1 answer

Locality-sensitive hashing - Elasticsearch

is there any plugin allowing LSH on Elasticsearch? If yes, could you point me to the location and tell me a little how to use it? Thanks Edit: I found out that ES uses MinHash plugin. How could I compare documents to one another with this? What…
12
votes
1 answer

Locality Sensitive Hashing of sparse numpy arrays

I have a large sparse numpy/scipy matrix where each row corresponds to a point in high-dimensional space. I want make queries of the following kind: Given a point P (a row in the matrix) and a distance epsilon, find all points with distance at most…
utdiscant
  • 11,128
  • 8
  • 31
  • 40
8
votes
1 answer

Two algorithms to find nearest neighbor with Locality-sensitive hashing, which one?

Currently I'm studying how to find a nearest neighbor using Locality-sensitive hashing. However while I'm reading papers and searching the web I found two algorithms for doing this: 1- Use L number of hash tables with L number of random LSH…
Jack Twain
  • 6,273
  • 15
  • 67
  • 107
7
votes
1 answer

Locality Sensitive Hash or pHash?

I'm trying to implement a general fingerprint memoizator: we have a file that can be expressed through an intelligent fingerprint (like pHash for images or chromaprint for audio) and if our desidered (expensive) function has already been computed on…
justHelloWorld
  • 6,478
  • 8
  • 58
  • 138
7
votes
2 answers

Implementation of locality-sensitive hashing with min-hash

I have read a lot of tutorials, documents, and pieces of code implementing LSH (locality-sensitive hashing) with min-hash. LSH tries to find the Jaccard coefficient of two sets by hashing random subsets and aggregating over those. I have looked at…
Majic Johnson
  • 341
  • 3
  • 15
6
votes
1 answer

Matching millions of people: k-d tree or locality-sensitive hashing?

I am looking for a performant algorithm to match a large number of people by location, gender and age according to this data structure: Longitude (denotes the persons location) Latitude (denotes the persons location) Gender (denotes the persons…
6
votes
2 answers

How to hash vectors into buckets in Locality Sensitive Hashing (using jaccard distance)?

I am implementing a near-neighbor search application which will find similar documents. So far I have read a good portion of LSH related materials (theory behind LSH is some kind of confusing and I am not able to comphrened it 100% yet). My code is…
tozak
  • 401
  • 4
  • 12
5
votes
3 answers

Make a Sim Hash (Locality Sensitive Hashing) Algorithm more accurate?

I have 'records' (basically CSV strings) of two names and and one address. I need to find records that are similar to each other: basically the names and address portions all look 'alike' as if they were interpreted by a human. I used the ideas from…
banncee
  • 959
  • 14
  • 30
5
votes
1 answer

How to bucket locality-sensitive hashes?

I already have the algorithm to produce locality-sensitive hashes, but how should I bucket them to take advantage of their characteristics(i.e. similar elements have near hashes(with the hamming distance))? In the matlab code I found they simply…
5
votes
3 answers

Pandas fuzzy detect duplicates

How can use fuzzy matching in pandas to detect duplicate rows (efficiently) How to find duplicates of one column vs. all the other ones without a gigantic for loop of converting row_i toString() and then comparing it to all the other ones?
Georg Heiler
  • 16,916
  • 36
  • 162
  • 292
5
votes
3 answers

Alter the hash function of a dictionary

Following this question, we know that two different dictionaries, dict_1 and dict_2 for example, use the exact same hash function. Is there any way to alter the hash function used by the dictionary?Negative answers also accepted!
gsamaras
  • 71,951
  • 46
  • 188
  • 305
1
2 3 4 5 6 7