Questions tagged [simhash]

Algorithm to detect similarities between hashes.

simhash was developed by Moses Charikar. The algorithm described in the paper.

21 questions
19
votes
2 answers

Choosing between SimHash and MinHash for a production system

I'm familiar with the LSH (Locality Sensitive Hashing) techniques of SimHash and MinHash. SimHash uses cosine similarity over real-valued data. MinHash calculates resemblance similarity over binary vectors. But I can't decide which one would be…
Brian Spiering
  • 1,002
  • 1
  • 9
  • 18
13
votes
3 answers

SimHash implementation in Java?

Has anyone come across a simhash function implemented in Java? I've already searched for it, but couldn't find anything.
Joel
  • 29,538
  • 35
  • 110
  • 138
5
votes
3 answers

Make a Sim Hash (Locality Sensitive Hashing) Algorithm more accurate?

I have 'records' (basically CSV strings) of two names and and one address. I need to find records that are similar to each other: basically the names and address portions all look 'alike' as if they were interpreted by a human. I used the ideas from…
banncee
  • 959
  • 14
  • 30
4
votes
4 answers

Hash function that maps similar inputs to similar outputs?

Is there a hash function where small changes in the input result in small changes in the output? For example, something like: hash("Foo") => 9e107d9d372bb6826bd81d3542a419d6 hash("Foo!") => 9e107d9d372bb6826bd81d3542a419d7 <- note small difference
Paul Wicks
  • 62,960
  • 55
  • 119
  • 146
3
votes
2 answers

How to compare the similarity of documents with Simhash algorithm?

I'm currently creating a program that can compute near-dupliate score within a corpus of text documents (+5000 docs). I'm using Simhash to generate a uniq footprint of a document (thanks to this github repo) my datas are : data = { 1: u'Im…
Dany M
  • 760
  • 1
  • 13
  • 28
3
votes
2 answers

What more advantageous minhash over simhash?

I am working with simhash but also see minhash is more effective. But I don't understand. Please explain for me: What more advantageous minhash over simhash ?
xfr1end
  • 303
  • 5
  • 8
3
votes
1 answer

MinHashing vs SimHashing

Suppose I have five sets I'd like to cluster. I understand that the SimHashing technique described here: https://moultano.wordpress.com/2010/01/21/simple-simhashing-3kbzhsxyg4467-6/ could yield three clusters ({A}, {B,C,D} and {E}), for instance, if…
cjauvin
  • 3,433
  • 4
  • 29
  • 38
2
votes
1 answer

Detect near duplicate document using simhash

I've found this python project in github but when I am trying to use it from my purpose to detect near-duplicate document e.g json, I'm not getting enough information from the README.md file on how to do that? It shows only to compute import…
A l w a y s S u n n y
  • 36,497
  • 8
  • 60
  • 103
2
votes
0 answers

MongoDB support search Bitwise XOR and Bit Count?

I would like to move from MYSQL to MongoDB, one of the question I can not find answer for, if I can get or simulate XOR and Bit Count, which I need. In MYSQL I would do: SELECT BIT_COUNT(SimHash ^ $SimHash) as simhash ... ORDER BY simhash It is…
2ge
  • 269
  • 4
  • 12
2
votes
0 answers

SimHash implementation in R

Is there an implementation of simhash in R? (SimHash is a hash algorithm created by Moses Charikaris which gives similar objects similar hashes)
dzeltzer
  • 990
  • 8
  • 28
2
votes
1 answer

Clarification needed about min/sim hashing + LSH

I have a reasonable understanding of a technique to detect similar documents consisting in first computing their minhash signatures (from their shingles, or n-grams), and then use an LSH-based algorithm to cluster them efficiently (i.e. avoid the…
1
vote
2 answers

simhash like algorithm to compare two text documents

The problem is: I have a collection of text documents, i want to pick up the most similar one to the input one. The input text document could be exactly match or modified partly. The algorithm must be very fast. Currently, I found simhash to take a…
xijing dai
  • 317
  • 1
  • 5
  • 14
1
vote
1 answer

how to allot index number using SimhashIndex() to a document dataset?

This code implements Simhash function of four set of data. import re from simhash import Simhash, SimhashIndex def get_features(s): width = 3 s = s.lower() s = re.sub(r'[^\w]+', '', s) return [s[i:i + width] for i in range(max(len(s)…
shipika singh
  • 65
  • 1
  • 1
  • 9
1
vote
1 answer

Hamming distance (Simhash python) giving out unexpected value

I was checking out Simhash module ( https://github.com/leonsim/simhash ). I presume that the Simhash("String").distance(Simhash("Another string")) is the hamming distance between the two strings. Now, I am not sure I understand this…
Ishan Sharma
  • 115
  • 2
  • 9
1
vote
1 answer

Pandas: matrix calculation on values

I have dataframe like this: apple aple apply apple 0 0 0 aple 0 0 0 apply 0 0 0 I want to calculate string distance e.g apple -> aple etc. My end result is here: apple aple apply apple 0…
1
2