13

Has anyone come across a simhash function implemented in Java?

I've already searched for it, but couldn't find anything.

Joachim Sauer
  • 302,674
  • 57
  • 556
  • 614
Joel
  • 29,538
  • 35
  • 110
  • 138
  • 2
    +1. Hadn't heard of this before. Interesting thread. – Mark Bolusmjak Dec 15 '09 at 16:36
  • Yes, they can be quite useful - unlike a normal hash, which attempts to be generate a sparsely populated hash space given a set of similar strings (i.e. generating dissimilar hashes) a simhash attempts to generate similar, local, hash values for similar strings. – Joel Dec 15 '09 at 17:00
  • That link says "hash each feature using a normal 32-bit hash algorithm"; i suspect that in Java, using [String's hashCode](http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#hashCode%28%29) would not be a good move for that, since for two-character strings, it boils down to `s[0]*31 + s[1]`, which is only a 20-bit number, and will have lots of zeroes in the upper of those 20 bits if the two characters are ASCII. A better option might be something like `(((long)s[0]) + 1) * (((long)((~s[0]) & 0xffff)) + 1) -1`. – Tom Anderson Sep 30 '12 at 10:29
  • @TomAnderson, [Guava's `HashFunction`](http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/hash/HashFunction.html#hashString%28java.lang.CharSequence%29) provides better fast non-crypto-strong hashing. – Mike Samuel Jan 17 '13 at 20:55
  • @MikeSamuel: Guava's `HashFunction` is an interface, and as such, doesn't provide anything. However, it's very likely that some implementations of it do provide better hashes - i see there are some good ones in [`Hashing`](http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/hash/Hashing.html). That said, in this particular case, we are only ever hashing two-character strings, so we don't need a good general-purpose hash algorithm; something simple might be good enough. Still, Murmur3-32 should definitely be good enough - as long as it's a revision from 2011-1-4 or later. – Tom Anderson Jan 17 '13 at 23:39
  • @MikeSamuel: The trouble with that is that it isn't a very good hash. If you take a pair of letters and change one of them, no more than half the bits will change, and in all probability, only a few will (eg 'ba' to 'ca' is only a single bit). A good hash would avalanche the change across the whole hash. Murmur3 does this; one could use that, or, because we're only starting with 32 bits, just use its [finalization step](http://code.google.com/p/smhasher/source/browse/trunk/MurmurHash3.cpp?spec=svn136&r=136#66). – Tom Anderson Jan 18 '13 at 10:16
  • @MikeSamuel: In this context, that's not how the hash is being used. Have a read of the simhash document that Joel linked to in the question. – Tom Anderson Jan 18 '13 at 17:18

3 Answers3

3

btw. It looks like Google has patented the algorithm. If you are in US, successfully compete with Google, and do not have own parent portfolio, then do not tell them you are using it.

An implementation in C

http://dsrg.mff.cuni.cz/~holub/sw/shash/


[Removed no longer relevant BibSonomy text]

jitter
  • 53,475
  • 11
  • 111
  • 124
  • 1
    based on the research from when i wrote that post there are loads of variants on sketching/simhash that aren't specifically the charikar implementation. if you look at the references in the actual paper you can hunt them down. – mat kelcey Feb 19 '10 at 00:28
1

Here you can find the full java source code. It's very simple. A demo also is provided. http://aneurone.blogspot.com/2012/09/simhash.html

aNeurone
  • 11
  • 2
  • 2
    Lone link is [considered a poor answer](http://stackoverflow.com/faq#deletion) since it is meaningless by itself and target resource is not guaranteed to be alive in the future. Please try to include at least summary of information you are linking to (like I did). – j0k Sep 05 '12 at 15:56
-2

According to this page, you should ask the developers of BibSonomy.

Aaron Digulla
  • 321,842
  • 108
  • 597
  • 820
  • It it just that there is a mention of a class on that page called SimHash.java, or does that that class actually implements the simhash algorithm as defined in the link in the question? – Joel Dec 15 '09 at 16:27