20

Are there any relatively simple to understand (and simple to implement) locality-sensitive hash examples in C/C++/Java/C#?

I'd like to learn more about the concept and so want to try an implementation on a few text files just to see how it works, so I don't need anything high-performance or anything... just an example of a hash function that returns similar hashes for similar inputs. I can learn more from it by example afterwards. :)

user541686
  • 205,094
  • 128
  • 528
  • 886

4 Answers4

9

For strings you can use approximate matching algorithm.

If the strings are equidistant from a reference string then chances are that they are similar to each other. And there you go you have a locality senitive hash implementation for strings.

You can create different hash buckets for a range of distances.

EDIT: You can try other variations of string distance. A simpler algorithm would just return no. of common characters between two strings.

Muhammad Hasan Khan
  • 34,648
  • 16
  • 88
  • 131
  • +1 it seems to be *exactly* what I'm looking for, I'll look at it a bit more. Thanks a lot! :) – user541686 Apr 24 '11 at 19:07
  • 2
    @Hasan: I'm a bit confused... the strings `"abcd"` and `"xyzw"` are both a distance 4 from a random string like `"6pGO"`, but they're completely different. How does that work? (It's 4 for pretty much *any* random string of length 4...) – user541686 Apr 24 '11 at 19:23
  • 1
    @Mehrdad this algorithm tells you how many changes you need to transform your input string into the reference (random) string. You can also try a simpler algorithm in which distance is no. of the characters common between the random string and your input string. – Muhammad Hasan Khan Apr 24 '11 at 19:59
  • 2
    @Hasan: Yes I understood what the algorithm does, but the problem is that it doesn't work well at all, like I showed. Do you happen to know of any better algorithms? (The one you just mentioned seems to be great, I'll think about it.) – user541686 Apr 24 '11 at 20:01
  • Such a hashing solution would work well similar strings but for irrelevant strings it will give you not useful results but if you look at it in that case you don't have similar strings anyway so any algorithm won't work. In the first case all irrelevant ones would fall under hashcode 4 with in that you can then maintain another hashlist for further breaking it down. – Muhammad Hasan Khan Apr 25 '11 at 04:30
  • You can get better results by generating n random strings each representing a single bucket (list). You find distance of input string with each bucket head (random string) and insert the string in the bucket with minimum distance. This is called clustering. Later on you can inspect the buckets individually and swap the head with item in the list which represents majority of strings in that list. For more algorithms try http://en.wikipedia.org/wiki/String_metric – Muhammad Hasan Khan Apr 25 '11 at 04:53
  • Hi @MuhammadHasanKhan, thanks for your insightful answers. Can i use the same approach in this problem https://stackoverflow.com/questions/54511595/optimize-matching-elements-from-two-large-data-sets-using-levenshtein-distance – Shahid Manzoor Bhat Nov 23 '19 at 15:56
6

Well there is an excellent into article at MSDN blogs here: http://blogs.msdn.com/b/spt/archive/2008/06/11/locality-sensitive-hashing-lsh-and-min-hash.aspx

Also there is at least once C++ library which you can inspect the source code of here: http://sourceforge.net/projects/lshkit/

Teoman Soygul
  • 25,584
  • 6
  • 69
  • 80
2

There is also a Java Implementation on Hadoop. it does a good job on documents.

it's called LikeLike

Currently Likelike supports only Min-Wise independent permutations. Min-Wise independent permutations is applied to the recommendation of Google News

Mehran
  • 1,977
  • 1
  • 18
  • 33
  • Thanks for the link, but it seems a bit too complicated for my case... I'm trying to avoid large libraries until I understand a simple one. :-) Thanks anyway, +1. – user541686 Apr 24 '11 at 19:06
2

I realise you explicitly asked for C/C++/C#, but there is a Python port of the nilsimsa hash which might be easier to grok than other, larger libraries.

Alec Thomas
  • 19,639
  • 4
  • 30
  • 24
  • +1 that's pretty great, I'm also learning Python so it kills two birds with one stone. Thanks for sharing this! :) – user541686 May 28 '11 at 01:17