comparing a list of lists with itself

Question

I have somewhat resolved this issue and I'm just trying to figure out a more efficient way of doing this. I have a big list of lists and I am trying to compare each list in the big list with one another.

How can I avoid duplicate comparisons, comparing lists that have already been compared?

Ex: big_list[0] has been compared with big_list[20] so there is no reason to compare big_list[20] with big_list[0] later on in the loop.

        big_list= [[0.12, 0.939, -0.321, 6.342], [0.12, 0.939, -0.321,6.342], [0.0, 1.0, -0.0, -5.166], [0.0, 1.0, 0.0, -5.166], [0.0, 1.0, -0.0, -5.166], [-0.0, 1.0, 0.0, -5.166], [0.0, 1.0, 0.0, -5.166], [0.0, 1.0, 0.0, -5.166], [0.0,1.0, -0.0, -5.166], [0.0, 1.0, 0.0, -5.166], [-0.0, 1.0, -0.0, -5.166], [-0.0, 1.0, 0.0, -5.166], [-0.12, 0.939, 0.321, 0.282], [-0.12, 0.939, 0.321, 0.282], [0.12, 0.939, 0.321, -17.782], [0.12, 0.939, 0.321, -17.782], [-0.0, 1.0, 0.0, 0.834], [0.0, 1.0, 0.0, 0.834], [0.0, 1.0, 0.0, 0.834], [0.0, 1.0, 0.0, 0.834], [-0.12, 0.939, -0.321, 24.406], [-0.12, 0.939, -0.321, 24.406], [0.0, 0.874, -0.486, 21.883], [0.0, 0.874, -0.486, 21.883], [0.0, 0.874, 0.486, -14.598], [0.0, 0.874, 0.486, -14.598]]

        for j in range(len(big_list)):
            for k in range(len(big_list)):
                if j!=k: 

                   result=math.sqrt(sum([(a-b)**2 for a,b in zip(big_list[j],big_list[k])])))

Previously, I resolved the matter by setting a specific tolerance and appending each result to a new list but I am trying to come up with a more efficient way of doing this. Eventually, big_list will probably have 1 million+ lists

if result<=rel_tol and big_list[k] not in new_list:
    new_list.append(big_list[k])

why are you comparing them with each other? If it is to sort the list you can specify a `key` which applies what ever logic to each element and compares that. — Tadhg McDonald-Jensen, May 03 '16 at 15:34

score 6 · Answer 1 · answered May 03 '16 at 15:32

6

Instead of doing:

for j in range(len(big_list)):
        for k in range(len(big_list)):

do this (note the j+1):

for j in range(len(big_list)):
        for k in range(j+1, len(big_list)):

This way your inner loop skips over all of the indexes you've already looked at, avoiding duplicate comparisons.

answered May 03 '16 at 15:32

Justin Buchanan

404
3
9

thanks, the end goal is to avoid duplicate comparisons – webmaker May 03 '16 at 15:44

score 1 · Answer 2 · answered May 03 '16 at 15:44

1

What @Justin replied was also my first thought, but after reflection I don't believe that's the most efficient for real big_lists. Instead, I'd use set()s of tuples:

tupled_set = set([tuple(i) for i in big_list])
new_list_of_tuples = list(tupled_set)

Shorter and faster with only builtin iterations. Now of course if you need to go back to lists (for some reason need mutable sub lists) then the performance gain may be lost (not sure tbh, would need to benchmark) but readability isn't:

list_of_lists = [list(i) for i in new_list_of_tuples]

Cheers

answered May 03 '16 at 15:44

smassey

5,875
24
37

This wouldn't necessarily result to the same ordering as the original code, depending on the context this might not matter though. – niemmi May 03 '16 at 15:48
1

@niemmi valid point! I had to google the stability of sets in python to be sure and learned something new today: http://stackoverflow.com/questions/3812429/is-pythons-set-stable Pythons sets are in fact stable (no future guarantee, they just happen to be that way today in CPython's implementation). – smassey May 03 '16 at 15:53
If I understood correctly that was about iterating multiple times over a set that hasn't been changed in between returns the items in the same order. That order isn't necessarily the same as insertion order. For example I get following on my machine: `list(set(range(5, 0, -1)))` -> `[1, 2, 3, 4, 5]`. – niemmi May 03 '16 at 16:15
Totally correct again, I misunderstood the example code. Cheers! – smassey May 03 '16 at 16:34

niemmi · Accepted Answer · 2016-05-03T17:34:19.953

1

Instead of comparing the lists against each other with two for loops you could convert them to tuples and use Counter to see how many instances there are. Then you could iterate over the list and pick the first occurrence of every sublist that has multiple instances.

from collections import Counter

c = Counter(tuple(l) for l in big_list)
new_list = []
for l in big_list:
    t = tuple(l)
    if c[t] > 1:
        new_list.append(l)
        c[t] = 0

This has O(n) time complexity and it will result to the same ordering as original code.

edited May 03 '16 at 17:34

answered May 03 '16 at 15:45

niemmi

17,113
7
35
42

thank you! i like how you reframed the question that i was trying to answer. – webmaker May 03 '16 at 17:20

comparing a list of lists with itself

3 Answers3