5

I have 'records' (basically CSV strings) of two names and and one address. I need to find records that are similar to each other: basically the names and address portions all look 'alike' as if they were interpreted by a human.

I used the ideas from this excellent blog post: http://knol.google.com/k/simple-simhashing# to write a simple SimHash. If the results of the SimHash for two or more strings are the same, I pass all records of this subset to a fine-grained matching program that is O(n^2) which compares every record of the set to every other record.

For the SimHash portion, I have parameters where I can define the datagram size (basically a sliding window of size n over the strings) and the number of iterations to use to determine how many (random) hashes I need to use for the SimHash calculation. So far a datagram size of 4 and using 4 hashes to compute the SimHash. I have tried various other combinations, but this one yields the best results so far.

The issue I'm running into is that this method finds about 80% of the duplicates in the data sets I have. I know this because I have verified the entire data set against the painfully slow O(n^2) full match mentioned above. The O(n^2) matcher is OK for data sets of less than 10^4, but quickly becomes unfeasable, since I need to run sets of size 10^8.

Any ideas, suggestions or thoughts on how I can increase the accuracy of the SimHash so more of the 'similar' records are tagged with the same SimHash number?

EDIT: Prior to SimHashing, I capitalize and remove all ![0-9A-Z] characters. Examples of things that should match (spelling mistakes are intentional):


  • JOHN SMITH, 123 ANY STREET SOMETOWN ZIP
  • JOHNNY SMITH, 123 ANY STRET
  • SOMETOWNE ZIP ROBERT PARKER, 442 ANY STREET SOMETOWN ZIP

Here 1 and 2 are similar, 3 is not. Output should be: 1 + 2

Joachim Sauer
  • 302,674
  • 57
  • 556
  • 614
banncee
  • 959
  • 14
  • 30

3 Answers3

3

Before you try to be fancy and change the sim hash, have you tried applying domain specific knowledge to the problem?

Do you have a list of missed pairs for your algorithm? Is there anything they have in common?

Have you tried doing things like removing capitalization, converting nick names to full names, dropping middle names, expanding N, E, S, W and north, south, east, west, expanding st to street, etc?

Rob Neuhaus
  • 9,190
  • 3
  • 28
  • 37
  • This is in general excellent advice. I think it would be far more accurate if I ran the data through some sort of standardizer for names and addresses. Unfortunately, this is very difficult in the target country (unlike the US, for example), and can be relied on only very minimally. – banncee Nov 30 '11 at 15:09
0

Simhash is not a suitable algorithm for this purpose as it's only useful for near-duplicate detection in which differences are very minor and the vast proportion of features are identical. See my tutorial on simhash and solving the hamming distance problem.

A better approach would be minhash, possibly with LSH. It looks as though your features for hashing on would best be generated as shingles of characters (with length of perhaps 4), rather than shingles of words.

Given such short text fields, and given that word orders are probably not likely to change much, you should consider including terminating shingles as well: shingles from the start and end of a text field that contain fewer than the normal number of characters, plus a terminating mark. This tends to be more lenient toward spelling differences on short runs of text, e.g. "Whitmore" and "Whitemore" without terminating shingles would yield

[ WHIT, HITM, ITMO, TMOR, MORE ] and [ WHIT, HITE, ITEM, TEMO, EMOR, MORE ] with low Jaccard similarity of 2/9;

whereas with terminating shingles included these would yield

[ #W, #WH, #WHI, WHIT, HITM, ITMO, TMOR, MORE, ORE#, RE#, E# ] and [ #W, #WH, #WHI, WHIT, HITE, ITEM, TEMO, EMOR, MORE, ORE#, RE#, E# ] with higher Jaccard similarity of 8/15;

Rob Neuhaus's suggestions on pre-normalizing are very sensible. I would normalize long-form words down to their abbreviations (e.g. "Saint James Street" would be normalized to "ST JAMES ST"). Normalizing in the other direction may be difficult with sometimes ambiguous abbreviations ("St" --> "STREET" or "SAINT"?), and also, the abbreviated forms contribute to fewer shingles and thus have less influence on overall similarity, which is good, because people often mistype "Road" for "Street", etc, and it doesn't change the meaning much.

Ben Whitmore
  • 857
  • 1
  • 6
  • 15
0

(I'd put the below in a comment but don't have the rep yet.)

What ultimately are you trying to do? Find all duplicates? How are you defining duplicates? Case sensitivity matters? Similar wording?

I am slightly confused about how you are going about this - finding similar records and creating a set, but then later O(n^2) checking for what I assume is exact equality. If you are checking for exact equality, then that seems to defeat the purpose of finding similar records (unless you are using that as a filter for your O(n^2) to save time.

Some random thoughts: Run each record through a sort of sanitizer that attempts to convert each record to the most general form (if you care / this matters).

If exact equality is what you are interested in, and memory is not a restriction, but you are looking for speed, you could just create a Java object for each record. Define the .equals() for each record (you could always customize this to not do exact equality). You will then need to define a hashCode() for this object. Then you can stick each record in a HashSet.

The resulting HashSet will have no duplicates (as defined by your .equals() / .hashCode() implementation).

Or if you want to find the duplicates, then before you add to the HashSet, check to see if it contains the record first, if it does - then you found a duplicate.

This implementation would be very fast, but could potentially use a LOT of memory as you would be storing the entire data set in memory. Alternatives to this would be to create a hash for each record and then store that in the HashSet and check the hashes for each record for equality.

The downside of doing a hash for each record is the challenge of developing a good hash generation with good distribution AND then of course with hashes you have to worry about false positives with collisions. But if your hashing algorithm is solid, then the chances for collision should be so rare that you shouldn't really worry about it.

Some thoughts about hashes you could do are something as simple as a MD5 of the concatenation of all fields. You could do a checksum. Or you could take the sum of the hashCode for each field. I'm not a super math genius so I can't tell you which would have the best distribution behavior and thus result in the least likely chance for collisions. Might be worth researching if you decide to go this route.

Harrison F
  • 492
  • 5
  • 15
  • Please see my edits to the OP. Since I am not after exact equality, I don't think the equals()+hash() method will work. Also, 'sorting' poses a problem, since the equality relationship between three records is potentially NOT transitive. – banncee Nov 30 '11 at 15:43
  • I did not see your edit to the OP before my post. My apologies. How does your O(n^2) method work? – Harrison F Nov 30 '11 at 15:50
  • 1
    I made it after your post :) to answer what I thought would be common questions. The O(n^2) method uses a subset matcher (John Patrick Ryan = John Ryan) and a bi-gram match of the strings based on a Dice Coefficient threshold. – banncee Nov 30 '11 at 16:00
  • So I'm not sure if your O(n^2) is based on subset matching / bi-gram matching OR if your O(n^2) is in relation to comparing each record to every other record (thus O(n^2)). Subset matching / bi-gram matching is probably expensive regardless. String likeness is an interesting problem which I can provide no expertise on. One last thought: You could take your subset-matching / bi-gram matching and put that into the .equals(). Would still have to devise a good hashCode(). This would however guarantee accuracy, and it still should be reasonably fast as .equals() is only called on hash collisions. – Harrison F Nov 30 '11 at 16:26
  • The O(n^2) is because I have to compare every record to every other one. Within this compare I have to do a subset match and bi-gram (if necessary). – banncee Nov 30 '11 at 18:40