Efficient way of calculating likeness scores of strings when sample size is large?

Question

Let's say that you have a list of 10,000 email addresses, and you'd like to find what some of the closest "neighbors" in this list are - defined as email addresses that are suspiciously close to other email addresses in your list.

I'm aware of how to calculate the Levenshtein distance between two strings (thanks to this question), which will give me a score of how many operations are needed to transform one string into another.

Let's say that I define "suspiciously close to another email address" as two strings having a Levenshtein score less than N.

Is there a more efficient way to find pairs of strings whose score is lower than this threshold besides comparing every possible string to every other possible string in the list? In other words, can this type of problem be solved quicker than O(n^2)?

Is Levenshtein score a poor choice of algorithms for this problem?

Take a peek at my response below, the problem is quite known since it's basically a spelling correction that you are looking after. — Matthieu M., Oct 23 '09 at 08:11
Please consider removing "accepted answer" status from Nick Johnson's answer, as its central claim of O(log n) time for approximate search is incorrect. — j_random_hacker, Oct 23 '12 at 02:18

score 7 · Accepted Answer · edited Jan 01 '16 at 13:36

7

Yup - you can find all strings within a given distance of a string in O(log n) time by using a BK-Tree. Alternate solutions involving generating every string with distance n may be faster for a levenshtein distance of 1, but the amount of work rapidly balloons out of control for longer distances.

edited Jan 01 '16 at 13:36

Flexo

87,323
22
191
272

answered Oct 23 '09 at 13:19

Nick Johnson

100,655
16
128
198

Querying the pre-build tree is O(log n) but what about building AND querying the tree as you go through the list of emails? I don't think this is any good for the clustering problem. – Andriy Volkov Oct 24 '09 at 00:28
Well, he doesn't state explicitly what his runtime requirements are - but I'm presuming he repeatedly gets new candidate addresses, and has to check them, then insert them into the corpus if they pass. In any case, building the tree from scratch is O(n log n), still substantially better than O(n^2). :) – Nick Johnson Oct 24 '09 at 11:34
This looks useful but I'm not sure if I understand how to apply this to my situation. I'm looking to find all pairs of email addresses with distance < N (the problem isn't "find all strings N distance from a certain email address). Is the algorithm to build a tree based on my dictionary (10,000 email addresses), and then query the tree for each address? – matt b Oct 26 '09 at 13:40
Start with an empty tree. For each email address, first perform a lookup on the tree for the new email with the given distance, and output all the values within distance n. Then, insert the address into the tree. This is O(n log n) assuming the number of pairs is also roughly O(n log n). – Nick Johnson Oct 27 '09 at 21:35
Here is an excellent explanation for using bk-trees in approximate string matching -http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees – Pranav Nov 15 '09 at 19:12
Thanks Pranav, but that's my own blog post - and I linked to it in my answer. – Nick Johnson Nov 16 '09 at 10:35
1

Looks like a nice algorithm that might work very well most of the time, but it's not at all clear from the linked page that it can find all strings within a given L-distance in O(log n) time (i.e. independent of the distance d) -- for starters, what if the entire set of strings is close enough? Merely outputting them all means it must be at least O(n) then :-P In more depth though, O(log n) bounds usually apply to tree searches where a single child is visited at each level, but it seems searching BK trees can require many children to be examined. -1 until you explain better! – j_random_hacker Oct 03 '12 at 20:45
@j_random_hacker -1 because you don't understand the algorithm is a bit harsh. Yes, worst case runtime is O(n) when all nodes match - just as worst case time for quicksort is O(n^2), for instance. And the article definitely covers the relationship between the edit distance and the proportion of the tree searched - the followup post on levenshtein automata provides empirical examples. The blog post also links to the original papers on which it's based. – Nick Johnson Oct 04 '12 at 09:27
1

When you say "the article definitely covers the relationship between the edit distance and the proportion of the tree searched", do you mean B & K's original paper that you link to from your blog post? Because that paper is inaccessible to me, as is the "Fast Approximate String Matching in a Dictionary" article you also link to, and there is no claim, let alone justification, of O(log n) behaviour in your linked blog post. You acknowledge that my O(n) assessment for BK-trees is correct, but then blame *me* for not understanding? (And I'm not sure what quicksort has to do with anything.) – j_random_hacker Oct 04 '12 at 12:11
If you mean the algorithm is O(log n) in the *average case*, say so. This is a weaker but still very useful property (though I still don't understand how it could be independent of d unless the average is taken over some distribution of strings, which IME is usually far from the distribution encountered in practice). Ideally you could sketch a proof of the O(log n) claim either here or in a linked post, but at the very least say in the post where this O(log n) claim is proven. – j_random_hacker Oct 04 '12 at 12:17
Yes, the _worst_ case where you return every element is O(n). I would hope that was obvious to anyone, since you can't process n elements in less than O(n). But when only a fraction of the elements in the tree match, as is the normal case, the proportion of the tree you explore is O(log n) - which is fairly self evident when exploring fewer than all the branches of any tree. My blogpost isn't intended to be an exhaustive CS-theoretic exercise, it's an accessible description of it. If you want the proofs, I suggest finding a way to read the original papers. – Nick Johnson Oct 04 '12 at 14:18
Also, since when is something you view as an omission in an otherwise correct answer justification for voting it down? In my mind, a negative vote is equivalent to saying "this answer is a net negative". Do you believe that to be the case here? – Nick Johnson Oct 04 '12 at 14:18
First, the blog post itself is great. But yes, I regard an answer which makes an attention-grabbing yet incorrect claim without backing it up to be a net negative. (When you say "Algorithm XYZ is O(f(x))" without qualifying further, it's universally assumed that you're talking about a worst-case, deterministic time bound.) Please either weaken the O(log n) claim to something like "typically fast in practice", sketch a proof, or link to something that does. It will then be an excellent answer :) – j_random_hacker Oct 04 '12 at 16:50
It would even suffice to clean up and clarify the complexity description ("O(log n) in the average case, independent of distance") and say that this is proven in the B & K paper, if in fact it is. – j_random_hacker Oct 04 '12 at 16:52
1

I've now read both papers. The BK73 paper proves an O(log n) lookup time *only for "Class 0" (exact-match) queries*. They don't attempt to analyse approximate-match queries, only giving some experimental results. The B-YN98 paper gives an analysis in its appendix showing that for non-exact queries, BK-trees are O(n^a) for some 0 < a < 1 which is a function of the expected distance between any pair of objects. In short, there is no evidence, either theoretical or empirical, for your claim that BK-trees find nearest neighbours at arbitrary distance in O(log n) time, so please delete it. – j_random_hacker Oct 04 '12 at 18:02

score 6 · Answer 2 · edited Jun 20 '20 at 09:12

This problem is known as clustering and is a part of a bigger deduplication problem (where you get to decide which member of the cluster is "the right" one), also known as merge-purge.

I once read a few research papers on exactly this topic (the names are below) and basically, the authors used a limited-size sliding window over a sorted list of strings. They would only compare (using an edit distance algorithm) N*N strings inside the window, thereby reducing the computational complexity. If any two strings looked similar, they were combined into a cluster (by inserting a record into a separate cluster table).

The first pass through the list was followed by a second pass where the strings were reversed before getting sorted. This way the strings with different heads had another chance to get close enough to be evaluated as part of the same window. On this second pass, if a string looked close enough to two (or more) strings in the window, and those strings were already parts of their own clusters (found by the first pass), the two clusters would then be merged (by updating the cluster table) and the current string would be added to the newly merged cluster. This clustering approach is known as the union-find algorithm.

Then they improved the algorithm by replacing the window with a list of top X substantially unique prototypes. Each new string would be compared to each of the top X prototypes. If string looked close enough to one of the prototypes, it would then be added to the prototype's cluster. If none of the prototypes looked similar enough, the string would become a new prototype, pushing the oldest prototype out of the top X list. (There was an heuristic logic employed to decide which of the strings in the prototype's cluster should be used as the new prototype representing the entire cluster). Again, if the string looked similar to several prototypes, all of their clusters would be merged.

I once implemented this algorithm for deduplication of name/address records with sizes of the lists being around 10-50 million records and it worked pretty damn fast (and found duplicates well too).

Overall for such problems, the trickiest part of course is finding the right value of the similarity threshold. The idea is to capture all the dups w/o producing too many false positives. Data with different characteristics tends to require different thresholds. The choice of an edit-distance algorithm is also important as some algorithms are better for OCR errors while others are better for typos and yet others are better for phonetic errors (such as when getting a name over the phone).

Once the clustering algorithm is implemented, a good way to test it is to get a list of unique samples and artificially mutate each sample to produce its variations, while preserving the fact that all the variations have come from the same parent. This list is then shuffled and fed to the algorithm. Comparing the original clustering with the clustering produced by the deduplication algorithm will give you the efficiency score.

Bibliography:

Hernandez M. 1995, The Merge/Purge Problem for Large Databases.

Monge A. 1997, An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records.

Egon · Answer 3 · 2009-10-22T20:37:40.377

5

You can do it with Levenshtein in O(kl), where k is your maximum distance and l is the maximum string.

Basically when you know how to calculate basic Levenshtein then it's easy to figure out that every result that is further than k from the main diagonal has to be bigger than k. So if you calculating main diagonal with width 2k + 1 will suffice.

If you have 10000 email addresses you won't need a faster algorithm. Computer can calculate with O(N^2) fast enough.

Levenshtein is quite good for this kind of problem.

Also what you might consider is transforming emails with soundex before comparing. You'll probably get better results.

edited Oct 22 '09 at 20:37

answered Oct 22 '09 at 20:29

Egon

1,705
18
32

Can you define what you mean by "main diagonal"? – matt b Oct 22 '09 at 20:55
You've got a matrix for calculating the Levenshtein distance. it's dimensions are something `m*n` so the main diagonal would be there, (i,i) where 0 <= i < min(m,n) – Egon Oct 22 '09 at 21:03
Think of it that way. When given a matrix for calculating Levenshtein distance. Then going sideways or up-down, makes a deletion or insertion that means if the best alignment comes from that path then it adds `1` to the distance. So if you go more than `k` times away from the diagonal then distance won't be better than `k`. – Egon Oct 22 '09 at 21:08
You can also speed up comparison with an automaton. Automaton construction for that recognizes a word with maximal distance k will take O(nk) time but checking if a word matches will take O(n). So if you have N emails it'll take O(N*n*k) to get all the automatons done. And then to check each email It'll take O(N*N*n). – Egon Oct 22 '09 at 21:37
Thanks, this is helpful. If I understand you correctly, this means that for a threshold=3, I shouldn't bother to calculate any scores for strings that are have a length +/- 3 from the string being tested - should be a nice optimization. – matt b Oct 23 '09 at 13:10
1

http://img101.imageshack.us/img101/7960/screenshotj.png See that picture. With threshold 3 you won't need to calculate the red parts. – Egon Oct 23 '09 at 13:59

Gabb0 · Answer 4 · 2009-10-23T22:30:47.943

2

I don't think you can do better than O(n^2) but you can do some smaller optimizations which could be just enough of a speedup to make your application usable:

You could first sort all email addresses by th part after the @ and only compare addresses where that is the same
You can stop calculating the distance between two addresses when it becomes bigger than n

EDIT: Actually you can do better than O(n^2), just look at Nick Johnsons answer below.

edited Oct 23 '09 at 22:30

answered Oct 22 '09 at 20:31

Gabb0

254
1
3
8

2

You can do a lot better than O(n^2) - see my answer. :) – Nick Johnson Oct 23 '09 at 22:14

score 1 · Answer 5 · answered Oct 22 '09 at 20:36

Let's say you have 3 strings:

1 - "abc" 2 - "bcd" 3 - "cde"

The L Distance between 1 & 2 is 2 (subtract 'a', add 'd'). The L Distance between 2 & 3 is 2 (subtract 'b', add 'e').

Your question is whether we can infer an L Distance between 1 & 3 by using the 2 comparisons above. The answer is no.

The L Distance between 1 & 3 is 3 (replace every character), there is no way that this can be inferred because of the scores of the first 2 calculations. The scores do not reveal whether deletions, insertions or substitution operations were performed.

So, I'd say that Levenshtein is a poor choice for a large list.

Yes, but by the nature of the metric, you can at least assume that the distance is 2+2 or less. This could be valuable information if you have a list of strings where the interesting distances are much smaller than the lengths of the strings. — jprete, Oct 22 '09 at 20:48

score 1 · Answer 6 · answered Oct 22 '09 at 20:39

1

10,000 email addresses sound not too much. For similarity search in a larger space you can use shingling and min-hashing. This algorithm is a bit more complicated to implement, but is much more efficient on a large space.

answered Oct 22 '09 at 20:39

Yuval F

20,565
5
44
69

1

I kind of picked 10,000 out of thin air, hoping it sounded "large", the real problem space is a few times larger (but not in the millions) – matt b Oct 22 '09 at 20:45
@Yuval, both approaches you mentioned are used to compare two big documents while this question is about clustering large numbers of small strings. Different problems altogether. – Andriy Volkov Oct 24 '09 at 00:09
@zvolkov - I heard a lecture about these approaches. It is debatable whether they fit small documents. They definitely are meant for finding similar documents among a very large set. – Yuval F Oct 27 '09 at 10:46
@zvolkov: "Shingling" (I know it as n-grams/k-tuples) is a great way to find groups of strings that are likely to have low L-distance. 1) Split each address into n-grams. 2) Build an array indexed by n-gram (so it will have e.g. 26^4 entries if n=4 and allowed characters are A-Z) where each entry contains a list of all addresses containing that n-gram (addresses being identified by integers). 3) For each such list, say of length k, write out k*(k-1)/2 integer pairs -- 1 for each pair of integers sharing that n-gram. 4) Sort this list of pairs. 5) # of same pair in a run of dupes ≈ similarity. – j_random_hacker Oct 03 '12 at 20:21
@j_random_hacker I see. So shingling _can_ be used for clustering, huh. Won't scale though. – Andriy Volkov Oct 03 '12 at 21:03
@zvolkov: You sure about that? :-P It's how many bioinformatics programs find matching pairs of DNA sequences. It even scales to datasets that don't fit in memory, because every step (even "building the array") can be done using sorting, and there are pretty efficient external sorting algorithms. The trick is to make n as large as possible so that the list in each array element is small. (But having n too large means some similarities will be missed.) – j_random_hacker Oct 03 '12 at 21:24

score 1 · Answer 7 · answered Oct 22 '09 at 22:08

If you really are comparing email addresses then one obvious way to do this would be to combine a levenshtein algorithm with domain mapping. I can think of times when I've signed up for something multiple times using the same domain, but variations on the username portion of the email address.

score 1 · Answer 8 · answered Oct 23 '09 at 08:10

It's possible to do better, at the condition of reversing the problem.

I suppose here that your 10.000 addresses are pretty 'fixed', otherwise you will have to add an update mechanism.

The idea is to use the Levenshtein distance, but in 'reverse' mode, in Python:

class Addresses:
  def __init__(self,addresses):
    self.rep = dict()
    self.rep[0] = self.generate_base(addresses)
      # simple dictionary which associate an address to itself

    self.rep[1] = self.generate_level(1)
    self.rep[2] = self.generate_level(2)
    # Until N

The generate_level method generates all possible variations from the previous set, minus the variations that already exist at a previous level. It preserves the 'origin' as the value associated to the key.

Then, you just have to lookup your word in the various set:

  def getAddress(self, address):
    list = self.rep.keys()
    list.sort()
    for index in list:
      if address in self.rep[index]:
        return (index, self.rep[index][address]) # Tuple (distance, origin)
    return None

Doing so, you compute the various sets once (it takes some times... but then you can serialize it and keep it forever).

And then lookup is much more efficient than O(n^2), though giving it exactly is kind of difficult since it depends on the size of the sets that are generated.

For reference, have a look at: http://norvig.com/spell-correct.html

This sounds like a pretty sound approach, except that I expect my dataset to be changing each day; so I'm curious if having to regenerate all of the variations each time costs me some efficiency. Thank you for the idea though and the Norvig link is always appreciated! — matt b, Oct 23 '09 at 13:52
The generation can be made by an offline process, then serialized and loaded on demand. — Matthieu M., Oct 23 '09 at 16:55
This won't work for huge lists (tens of millions of records). For that, see my answer. — Andriy Volkov, Oct 24 '09 at 00:12
Actually, it will. Norvig fed it a 6.2Mo file of words to 'train' its algorithm. But I think we are talking of different problems > I assume that the set of 'right' answers is known beforehand and you just try to 'group' the various addresses together. — Matthieu M., Oct 24 '09 at 15:15

Efficient way of calculating likeness scores of strings when sample size is large?

8 Answers8

Bibliography:

Linked