efficient algorithm for comparing two lists

Question

I'm building a similarity matrix of a list of items. The naive approach is to iterate the list twice, but this needlessly will compare A:B and B:A when they're the same.

for A in items:
   for B in items:
      if A==B: continue
      sim[A][B] = calc_sim(A, B)

is there a simple way to only calculate half of the values? I could put a skip in there like

if sim[B][A]: continue # already calculated in other direction

But still the iteration is happening. Effectively I just want to iterate through the top or bottom half of the grid:

There are some similar Qs, but nothing with a canonical answer. This seems like a basic CS algo question!

the above is pseudo-code. i am writing python in this case tho yes but it's more about an algorithm language independent. I know i can use libraries like sklearn for NearestNeighbot but more interested in the raw algo for myself. added a python tag anyway — dcsan, Jan 02 '21 at 22:33
@superbrain I think you could be right! simplest is best haha. — dcsan, Jan 03 '21 at 06:08

score 3 · Answer 1 · answered Jan 02 '21 at 22:40

3

You could use itertools.combinations.

import itertools

for a, b in itertools.combinations(items, 2):
    sim[a][b] = sim[b][a] = calc_sim(a, b)

answered Jan 02 '21 at 22:40

abc

11,579
2
26
51

oh thanks for pointing this out. it's a bit python specific, but itertools is a great toolbox. – dcsan Jan 02 '21 at 22:57

mandulaj · Answer 2 · 2021-01-02T23:01:05.573

0

Assuming calc_sim(A, B) == calc_sim(B, A), you could try this:

for A in range(0, len(items)):
   for B in range(A, len(items)): # Replace with A+1 if you don't want the case A == B
      # Remember A and B are indexes, so change code accordingly
      result = calc_sim(items[A], items[B])
      sim[A][B] = result # Copy result to both A,B and B,A as they are equal
      sim[B][A] = result

However actually both algorithms are O(n) n²

edited Jan 02 '21 at 23:01

answered Jan 02 '21 at 22:39

mandulaj

733
3
10

1

Now `calc_sum` will get *indices*, not the actual *values* – trincot Jan 02 '21 at 22:43
Yeah, I made a note about it, but then @abc has a better answer.... https://stackoverflow.com/a/65544844/2180316 – mandulaj Jan 02 '21 at 22:47
`for B in range(A,` ... nice trick, this will effectively cut off one half of the mirror image. – dcsan Jan 02 '21 at 22:54
Its not very Pythonic, you should use the itertools method. – mandulaj Jan 02 '21 at 23:00

fdermishin · Answer 3 · 2021-01-02T22:57:31.780

0

If you need just a general algorithm to reduce number iterations, you can limit the range of the inner loop

for i, A in enumerate(items):
   for B in items[:i]:
      sim[A][B] = calc_sim(A, B)

But if you are looking for Python-specific optimization, it would be much better to use numpy vectorization. For example, if calc_sim(a, b) computes squared difference between a and b, then it can be vectorized the following way:

import numpy as np

list = [1, 2, 3]
array = np.array(list)
sim = np.square(array[:,np.newaxis] - array)

[[0 1 4]
 [1 0 1]
 [4 1 0]]

edited Jan 02 '21 at 22:57

answered Jan 02 '21 at 22:42

fdermishin

3,519
3
24
45

it's actually an array of word vectors I'm using to do the comparison. Can `np` be made to take any comparison function and apply it to a grid in that way, or just np built-ins like `np.square` ? – dcsan Jan 02 '21 at 22:59
@dcsan It can be done if the vectors have the same length. I think that this question may help: https://stackoverflow.com/questions/35215161/most-efficient-way-to-map-function-over-numpy-array But if the vectors have different lengths, than numpy is unlikely to be able to handle them, as long as they are not preprocessed in some way – fdermishin Jan 02 '21 at 23:06

efficient algorithm for comparing two lists

3 Answers3