Efficient way to compare two lists of different lengths - Python

Question

I'm trying to compare list_A: 60 elements against list_B: ~300,000 elements, and return a count (in the form of a list) of the number of elements in list_A that appear for each of the elements in list_B.

My lists appear like:

list_A = ['CAT - cats are great', 'DOG - dogs are great too'] 
list_B = ['CAT - cats are great(A)DOG - dogs are great too(B)', 'DOG - dogs are great too(B)']

I would like my counts to return: [2, 1]

My implementation works, but it contains a nested for loop which leads to a long running time.

list = []
for i in range(len(list_B)):
    count = 0
    for j in range(len(list_A)):
        if (list_A[j] in list_B[i]):
            count += 1
    list.append(count)
return list

Any help would be much appreciated! Thanks :)

Convert `list_A` to a `set`, then `in` will be O(1) instead of O(n). — Barmar, Aug 28 '19 at 21:11
You're not creating a list of results, you're just counting everything all at once. — Barmar, Aug 28 '19 at 21:12
I don't think that's necessarily the case, if an element has an 'o' it shouldn't be counted in the resulting list. — Celius Stingher, Aug 28 '19 at 21:13
@CeliusStingher Yeah, his example doesn't include any letters that aren't in `list_A`. — Barmar, Aug 28 '19 at 21:14
You have lists of variable names; were these supposed to be strings? Your posted code fails to run due to undefined symbols, no call to the function, ... — Prune, Aug 28 '19 at 21:14
The `count` method or `collections.Counter` would be faster. Use `sum` over a comprehension on `list_A`, and you give the interpreter a good chance to speed up processing with vectorizations. — Prune, Aug 28 '19 at 21:16
Added a new answer that gives an example implementation. It is a bit long, but should be the fastest possible algorithm given the languge and generic constraints. It also allows counting multiple occurrences of the pattern, if desired. — Cireo, Aug 29 '19 at 06:22

Barmar · Accepted Answer · 2019-08-28T22:39:17.953

0

Since you're looking for substrings, I don't think there's any way to optimize it. You can use a list comprehension and sum() to simplify the code, though.

result = [sum(phrase in sentence for phrase in list_A) for sentence in list_B]

edited Aug 28 '19 at 22:39

answered Aug 28 '19 at 21:18

Barmar

741,623
53
500
612

Oops! I forgot to mention that the lists are lists of strings. Thanks Barmar! Your implementation works with those lists, but not with the lists that I have which is strange. My strings appear like: list_A = ['CAT - cats are great', 'DOG - dogs are great too'] list_B = ['CAT - cats are great(A)DOG - dogs are great too(B)', 'DOG - dogs are great too(B)'] And for some reason my results return all zeros. – Jessica Aug 28 '19 at 22:26
I assumed the first list is a list of letters, the second is a list of strings where you want to check each letter against the first. – Barmar Aug 28 '19 at 22:28
You need to edit the question to show realistic examples. – Barmar Aug 28 '19 at 22:29
Sorry for the confusion. I've edited my question with an example that's closer to what my data looks like, and I've changed my code as it was wrong. Thanks for your help! – Jessica Aug 28 '19 at 22:34
Thanks for your guidance, @Barmar. Much appreciated. – Jessica Aug 28 '19 at 22:55

score 0 · Answer 2 · answered Aug 28 '19 at 21:42

@Barmar's answer is fast and correct if you know list_A in advance, or only need to run it once. If that is not the case, you may consider the below as an alternative (which should also be fast, but has more steps).

import collections 

def count(target, summaries):
    return [sum(s[t] for t in target) for s in summaries]

mines = ['aa', 'ab', 'abc', 'aabc']
summaries = [collections.Counter(m) for m in mines]
gold = ['a', 'b']
silver = ['c']
assert count(gold, summaries) == [2, 2, 2, 3]
assert count(silver, summaries) == [0, 0, 1, 1]

It's also worth noting that if you are looking at 60 / 300000, there might be some speed-ups and simplifications that are missing from this toy example, e.g. if 60 is the numbers 1-60, or alphanumerics, etc. It might also be that the number of values that don't match is so small that it is easier to count those and remove it from the length.

She's changed the question, it's now matching arbitrary substrings, not individual characters. — Barmar, Aug 28 '19 at 22:41
Will put a different answer to a different question in a new answer — Cireo, Aug 29 '19 at 05:50

score 0 · Answer 3 · answered Aug 29 '19 at 06:18

I have actually answered a nearly identical question before, which can be found here: https://stackoverflow.com/a/55914487/2284490 The only difference is that you want to know len(matches) instead of any(matches) on the algorithm.

This can be solved efficiently as variations on the Aho Corasick algorithm

It is an efficient dictionary matching algorithm that locates patterns within text simultaneously in O(p + q + r), with p = length of patterns, q = length of text, r = length of returned matches.

You may want to run two separate state machines simultaneously, and you would need to modify them so they terminate on the first match.

I took a stab at the modifications, starting with this python implementation

class AhoNode(object):
    def __init__(self):
        self.goto = {}
        self.count = 0
        self.fail = None

def aho_create_forest(patterns):
    root = AhoNode()
    for path in patterns:
        node = root
        for symbol in path:
            node = node.goto.setdefault(symbol, AhoNode())
        node.count += 1
    return root

def aho_create_statemachine(patterns):
    root = aho_create_forest(patterns)
    queue = []
    for node in root.goto.itervalues():
        queue.append(node)
        node.fail = root
    while queue:
        rnode = queue.pop(0)
        for key, unode in rnode.goto.iteritems():
            queue.append(unode)
            fnode = rnode.fail
            while fnode is not None and key not in fnode.goto:
                fnode = fnode.fail
            unode.fail = fnode.goto[key] if fnode else root
            unode.count += unode.fail.count
    return root

def aho_count_all(s, root):
    total = 0
    node = root
    for i, c in enumerate(s):
        while node is not None and c not in node.goto:
            node = node.fail
        if node is None:
            node = root
            continue
        node = node.goto[c]
        total += node.count
    return total

def pattern_counter(patterns):
    ''' Returns an efficient counter function that takes a string
    and returns the number of patterns matched within it
    '''
    machine = aho_create_statemachine(patterns)
    def counter(text):
        return aho_count_all(text, machine)
    return counter

and to use it

patterns = ['CAT - cats are great', 'DOG - dogs are great too'] 
counter = pattern_counter(patterns)
text_list = ['CAT - cats are great(A)DOG - dogs are great too(B)',
             'DOG - dogs are great too(B)']
for text in text_list:
    print '%r - %s' % (text, counter(text))

This displays

'CAT - cats are great(A)DOG - dogs are great too(B)' - 2
'DOG - dogs are great too(B)' - 1

Note that this solution counts each match individually, so looking for 'a' and 'b' in 'aba' would give 3. If you wanted only one match per pattern, you would need to track all patterns seen, you would make a minor modification to convert the integer to a set:

- self.count = 0
+ self.seen = set()
...
- node.count += 1
+ node.seen.add(path)
...
- unode.count += unode.fail.count
+ unode.seen |= unode.fail.seen
...
- total = 0
+ all_seen = set()
- total += node.count
+ all_seen |= node.seen
- return total
+ return len(all_seen)

Efficient way to compare two lists of different lengths - Python

3 Answers3