Finding most common elements from large amount of lists

Question

I have a lot of lists, and I want to find group of most common elements that appear in the lists.

For example:

l1 = ['apple', 'banana', 'orange']
l2 = ['apple', 'banana', 'grape', 'lemon', 'orange']
l3 = ['banana', 'grape', 'kiwi']
l4 = ['apple', 'grape', 'kiwi', 'peach']
l5 = ['apple', 'blueberry', 'grape', 'kiwi', 'orange', 'pear']
l6 = ['chery', 'kiwi', 'pear']

How to find group of elements that appear in these lists, for example:

Group 1: ['apple', 'banana', 'orange'] in l1, l2, appear 2 times
Group 2: ['apple', 'grape', 'kiwi'] in l4, l5, appear 2 times
Group 3: ['apple', 'grape', 'orange'] in l2, l5, appear 2 times

Index of the element is not important. Group should have minimum 3 and maximum 5 elements. Lists can have from 3 to 10 elements.

I know that I can do something like this with intersections, but what if I have totally different list:

l7 = ["x", "y", "z", "k"]

Elements from this list are not appearing in any other list

jus to understand the qst, subgroups are not counted ? like `['apple', 'banana']` ? only the groups made from an actual list from l1, l2 ... ? — mrCopiCat, Sep 02 '22 at 11:32
group needs to have minimum 3 elements. And only the groups made from an actual list from l1, l2 — taga, Sep 02 '22 at 12:01
check my answer, in it you can specify the minimum you want, i choose 2 for the sake of the example. — mrCopiCat, Sep 02 '22 at 12:10

score 1 · Accepted Answer · answered Sep 02 '22 at 12:09

I think this might help, (I gave it a degree of freedom on the number of combinations to to extract and the minimum number of elements in each given combination :

import itertools

def get_max_rep(all_lists, min_comb_len, num_el_displayed):
    """Returns 'num_el_displayed' number of elements that repeat the most in the given lists"""
    # extract all the unique values of all the lists
    all_elements = set([el for l in all_lists for el in l])

    # build all the possible combinations starting from 'min_comb_len' number of elements
    combinations = [
        el for r in range(min(min_comb_len, len(all_elements)-1),len(all_elements)) 
        for el in itertools.combinations(all_elements, r)
        ]

    # count the number of repetitions of each combination in the given lists
    out = sorted(
        [(comb, sum([all(fruit in el  for fruit in comb) for el in all_lists])) 
        for comb in combinations], key=lambda x:x[1], reverse=True
        )[:num_el_displayed]

    return out

To test it out (here I want the first 5 combinations that have the most repetitions and that have a minimum of 2 elements:

# testing ...
l1 = ['apple', 'banana', 'orange']
l2 = ['apple', 'banana', 'grape', 'lemon', 'orange']
l3 = ['banana', 'grape', 'kiwi']
l4 = ['apple', 'grape', 'kiwi', 'peach']
l5 = ['apple', 'blueberry', 'grape', 'kiwi', 'orange', 'pear']
l6 = ['chery', 'kiwi', 'pear']

all_lists = [l1, l2, l3, l4, l5, l6]

print(get_max_rep(all_lists, 2, 5))

output:

[(('grape', 'kiwi'), 3), (('grape', 'apple'), 3), (('orange', 'apple'), 3), (('banana', 'grape'), 2), (('banana', 'orange'), 2)]

Works like a charm! Can you please write comments, because I'm not that good with itertools? — taga, Sep 02 '22 at 12:16
okay, so basically i look for all the possible combinations with the length of `r` in the elements of your list (without repetition), and that's what `itertoos.combinations` does, I also loop^over it to ginf all the combinations with all the lengths from the minimum specified to the maximum (which is the number of unique fruits in your case.), If you have a specific qst I'll be happy to answer it ! — mrCopiCat, Sep 02 '22 at 12:19
I just want to point out that instead of doing `set([list comprehension])` you can do directly set comprehension(e.g. `{a for a in lst}`), that's because the syntax is different for the set comprehension and the dictionary comprehension(e.g. `{k:v for k,v in dct}`). this causes to avoid function call and overhead — XxJames07-, Sep 02 '22 at 12:27
@XxJames07- ah yes you are right, did it on a rush so i didn't check my code again :), thanks for pointing that out — mrCopiCat, Sep 02 '22 at 12:29

score 0 · Answer 2 · answered Sep 02 '22 at 12:16

The problem you are facing is called Frequent Itemsets Mining. The 2 most popular algorithms (at least that I know) that solve this problem are:

They are very well explained in the book Data Mining. Concepts and Techniques (3rd Edition). A library that implements apriori is apyori, and this is how you could use it:

from apyori import apriori

transactions = [['apple', 'banana', 'orange'],
    ['apple', 'banana', 'grape', 'lemon', 'orange'],
    ['banana', 'grape', 'kiwi'],
    ['apple', 'grape', 'kiwi', 'peach'],
    ['apple', 'blueberry', 'grape', 'kiwi', 'orange', 'pear'],
    ['chery', 'kiwi', 'pear']]

results = list(map(lambda x: list(x.items), 
        filter(lambda x: len(x.items) >= 3, 
            apriori(transactions, min_support=2/len(transactions), max_length=5)
        )
    ))

print(results)

It returns:

[['banana', 'apple', 'orange'], ['grape', 'kiwi', 'apple'], ['grape', 'apple', 'orange']]

About the arguments of the apriori call:

min_support is the minimum frequency a certain itemset must have. In your case, it is 2/len(transactions) because the itemset must be present in at least 2 of the transactions
max_length is the maximum length of an itemset
I had to filter the result because otherwise I would get also the itemsets whose length is less than 3. (It is weird that there is no argument for the min length too, maybe I did not find it but the help(apyori.apriori) does not mention anything like that).

XxJames07- · Answer 3 · 2022-09-02T12:36:29.640

You could do something like this using a powerset of the combinations, obviously excluding the ones that are not in the range 3-5 lists.

Note: For the powerset function i give credit to this other answer from martineu:

Using itertools.chain.from_iterable and itertools.combinations to get the combinations.

 itertools.chain.from_iterable(iterable)

Alternate constructor for chain(). Gets chained inputs from a single iterable argument that is evaluated lazily

and using functools.reduce to do some workaround for the intersections, I have come up with this code that think that might help:

functools.reduce(function, iterable[, initializer])

from the docs:

Apply function of two arguments cumulatively to the items of iterable, from left to right, so as to reduce the iterable to a single value.

from itertools import chain, combinations
from functools import reduce
from pprint import pprint

lsts = [['apple', 'banana', 'orange'],
        ['apple', 'banana', 'grape', 'lemon', 'orange'],
        ['banana', 'grape', 'kiwi'],
        ['apple', 'grape', 'kiwi', 'peach'],
        ['apple', 'blueberry', 'grape', 'kiwi', 'orange', 'pear'],
        ['chery', 'kiwi', 'pear']]

lsts = tuple(map(set, lsts))


def powerset(iterable, start=0, stop=None): #from answer https://stackoverflow.com/a/40986475/13285707
    "powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
    s = list(iterable)  # allows duplicate elements
    if stop is None:
        stop = len(s)+1
    return chain.from_iterable(combinations(s, r) for r in range(start, stop))


groups = []
for x in powerset(lsts, 3, 5):
    if 3 < len(c:=reduce(set.intersection,x)) < 5:
        groups.append(c)


pprint(groups)

Output:

[{'apple', 'orange'}, {'apple', 'grape'}, {'grape', 'kiwi'}]

Shahab Rahnama · Answer 4 · 2022-09-02T12:38:47.460

My solutions:

l1 = ['apple', 'banana', 'orange']
l2 = ['apple', 'banana', 'grape', 'lemon', 'orange']
l3 = ['banana', 'grape', 'kiwi']
l4 = ['apple', 'grape', 'kiwi', 'peach']
l5 = ['apple', 'blueberry', 'grape', 'kiwi', 'orange', 'pear']
l6 = ['chery', 'kiwi', 'pear']

# return same elements of two lists
def sub_eq_lists(l1, l2):
    return list(set([x for x in l1 + l3 if x in l1 and x in  l2]))

all_lists = [l1, l2, l3, l4, l5, l6]
results = []
for j in all_lists:
    for l in all_lists:
        if(l!=j):
            same_elements = sub_eq_lists(j,l)
            if (len(same_elements) > 2) and (len(same_elements) < 11):# your conditions
   
                results.append(same_elements)
# count occurrences
results = {str(item): results.count(item) for item in results}
results

Output:

 {"['apple', 'orange', 'banana']": 2,
 "['apple', 'orange', 'grape']": 2,
 "['apple', 'kiwi', 'grape']": 2}

score 0 · Answer 5 · answered Sep 02 '22 at 12:55

You can use A.issubset(B) in order to check if elements of A are present in B or not, it will simply return False.


    def count1(grp=None, list_num=None):
        return grp.issubset(list_num)
        
    if __name__ == "__main__":
        g1 = {'apple', 'banana', 'orange'}
        g2 = {'apple', 'grape', 'kiwi'}
        g3 = {'apple', 'grape', 'orange'}
        
        l1 = ['apple', 'banana', 'orange']
        l2 = ['apple', 'banana', 'grape', 'lemon', 'orange']
        l3 = ['banana', 'grape', 'kiwi']
        l4 = ['apple', 'grape', 'kiwi', 'peach']
        l5 = ['apple', 'blueberry', 'grape', 'kiwi', 'orange', 'pear']
        l6 = ['chery', 'kiwi', 'pear']
        
        group_count = {}
        groups = [g1, g2, g3]
        list_check = [l1, l2, l3, l4, l5, l6]
        
        for index_list, list_num in enumerate(list_check):
            for index, grp in enumerate(groups):
                    if count1(grp, list_num):
                        group_count.setdefault(f'g{index + 1}',[]).append(f'l{index_list + 1}')
                        
        print(group_count)

Output : This what i understood from question.

{'g1': ['l1', 'l2'], 'g3': ['l2', 'l5'], 'g2': ['l4', 'l5']}

Finding most common elements from large amount of lists

5 Answers5