Python - pairwise list of name based on similarity (algo and practical)

Question

I have a list of 250,000 names. If I use itertools combinations to get a list of unique pairs: (my list is named 'content')

for pair in itertools.combinations(content, 2):
    print pair

this will be the length of 250,000 choose 2, which is a really high number. For each pair I want to compute a text similarity measure which I've already written. A few questions:

a) should I use a yield statement along with the itertools method to create generator, since I can't fit that many pairs in memory? At some point though to use my text similarity function, I'll need to save the output into memory or write to a file, no?

b) does this generalize to a more common problem that I'm just not aware of? Maybe a sparse matrix?

c) is there some pre-processing step that could reduce the size of the list?

Thanks!

Jeff

[xy](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem): *Why* are you finding the similarity between the pairs? What are you trying to solve? — Ami Tavory, Jan 26 '16 at 19:57
trying to find similar values, then review high match ratios to see if they are variants of the same end-user — Jeffrey James, Jan 26 '16 at 20:04
This is going to depend on your similarity metric. Sometimes you can optimize it to be [done in linear or near linear time](http://stackoverflow.com/q/23053688/572670), and some other cases - you cannot, and you actually need to compare all pairs. It entirely depends on the similarity function. — amit, Jan 26 '16 at 20:12

score 1 · Answer 1 · answered Jan 26 '16 at 20:15

From the clarification in your comment, it looks like you shouldn't unconditionally print out the pairs, but rather those that pass some criteria. For example,

for lhs, rhs in itertools.combinations(content, 2):
    if dist(lhs, rhs) < thresh:
        print pair

might reduce the size dramatically.

Note that it's easy to find the number of pairs, with almost no memory:

sum((dist(lhs, rhs) < thresh) for lhs, rhs in itertools.combinations(content, 2))

This can give you an estimate on the number of such pairs.

If the number is too high, you'll probably want to randomly sample some of the pairs. You can either sample the pairs uniformly, or uniformly sample lhs and track all the rhss that map to them. Etc.

I think that's as far as it's possible to answer without knowing the settings further.

score 0 · Answer 2 · answered Jan 26 '16 at 23:24

I ended up using this: I created a generator to avoid the process being read into memory

def pair_iter(somelist, n):
    for x in itertools.combinations(somelist, 2):
            if fuzz.partial_ratio(x[0], x[1]) > 90 and abs(len(x[0]) - len(x[1])) < 3:
                yield fuzz.partial_ratio(x[0], x[1]), x

Python - pairwise list of name based on similarity (algo and practical)

2 Answers2