I have 'records' (basically CSV strings) of two names and and one address. I need to find records that are similar to each other: basically the names and address portions all look 'alike' as if they were interpreted by a human.
I used the ideas from this excellent blog post: http://knol.google.com/k/simple-simhashing# to write a simple SimHash. If the results of the SimHash for two or more strings are the same, I pass all records of this subset to a fine-grained matching program that is O(n^2) which compares every record of the set to every other record.
For the SimHash portion, I have parameters where I can define the datagram size (basically a sliding window of size n over the strings) and the number of iterations to use to determine how many (random) hashes I need to use for the SimHash calculation. So far a datagram size of 4 and using 4 hashes to compute the SimHash. I have tried various other combinations, but this one yields the best results so far.
The issue I'm running into is that this method finds about 80% of the duplicates in the data sets I have. I know this because I have verified the entire data set against the painfully slow O(n^2) full match mentioned above. The O(n^2) matcher is OK for data sets of less than 10^4, but quickly becomes unfeasable, since I need to run sets of size 10^8.
Any ideas, suggestions or thoughts on how I can increase the accuracy of the SimHash so more of the 'similar' records are tagged with the same SimHash number?
EDIT: Prior to SimHashing, I capitalize and remove all ![0-9A-Z] characters. Examples of things that should match (spelling mistakes are intentional):
- JOHN SMITH, 123 ANY STREET SOMETOWN ZIP
- JOHNNY SMITH, 123 ANY STRET
- SOMETOWNE ZIP ROBERT PARKER, 442 ANY STREET SOMETOWN ZIP
Here 1 and 2 are similar, 3 is not. Output should be: 1 + 2