Finding common intersection between various values of a columns in a dataframe where values are array/list

Question

I have a dataframe whose first 5 rows looks like this.

 userID       CategoryID          sectorID
agunii2035  [16, 17, 3, 12, 1]  [2, 33, 29, 18, 23]
agunii3007  [2, 4, 6, 3, 16]    [4, 15, 29, 10, 18]
agunii2006  [8, 16, 2, 5, 12]   [38, 18, 7, 36, 33]
agunii2003  [6, 4, 2, 5, 17]    [37, 12, 3, 32, 34]
agunii3000  [12, 11, 7, 3, 1]   [38, 1, 13, 25, 3]

Now for any userID (let say "userID" = 'agunii2035') , I want to get the "userID"s whose "CategoryID" or "SectorID" have at least one common intersection value (For example, since agunni2035 and aguni3007 have at least one common "CategoryID" i.e '16' or have one common "sectorID" i.e. '29', we will consider the "userID " 'agunii3007')

The output can be a dataframe that looks like this

   userID         user_with_common_cat/sectorID
agunii2035      {aguni3007, agunni2006, agunii2003, agunii300}
aguni3007       {agunni2035,agunni2006,agunii2003}

and so on

or this can also be

   userID         user_with_common_cat/sectorID
agunii2035      [aguni3007, agunni2006, agunii2003, agunii300]
aguni3007       [agunni2035,agunni2006,agunii2003}

and so on

Any help on this please?

Edit

What I have done so far:

userID= 'agunii2035'

common_users = []

for user in uniqueUsers:
    common = list(set(df_interest.loc[df_interest['userID'] == 'agashi2035', 'categoryID'].iloc[0]).intersection(df_interest.loc[df_interest['userID'] == user, 'categoryID'].iloc[0]))
                  
    #intersect = len(common) > 0
                  
    if (len(common) > 0):
            common_users.append(user)

I want to do this for sectors as well and make the intersection for either sector or category and append to the common_user list if length of any intersection is 1.

Also, I want to do this for all the users.

I have just done a static one for a single user. `userID= 'agunii2035' common_users = [] for user in uniqueUsers: common = list(set(df_interest.loc[df_interest['userID'] == 'agashi2035', 'categoryID'].iloc[0]).intersection(df_interest.loc[df_interest['userID'] == user, 'categoryID'].iloc[0])) #intersect = len(common) > 0 if (len(common) > 0): common_users.append(user)` I want to add the sectors part as well and do this for all users. — d_b, Oct 16 '20 at 07:19

score 3 · Accepted Answer · answered Oct 16 '20 at 07:36

I usually don't really like to manipulate dataframe where a "cell" contains a list and not s single element (float, str, etc.).

In the following I will manipulate a python dict instead of a dataframe.

Data

You can transform a pandas dataframe into dict with the to_dict method doc.

Here are the data in dictionnary:

d = {
        "agunii2035": {
            "category_id": [16, 17, 3, 12, 1],
            "sector_id": [2, 33, 29, 18, 23],
        },
        "agunii3007": {
            "category_id": [2, 4, 6, 3, 16],
            "sector_id": [4, 15, 29, 10, 18],
        },
        "agunii2006": {
            "category_id": [8, 16, 2, 5, 12],
            "sector_id": [38, 18, 7, 36, 33],
        },
        "agunii2003": {
            "category_id": [6, 4, 2, 5, 17],
            "sector_id": [37, 12, 3, 32, 34],
        },
        "agunii3000": {
            "category_id": [12, 11, 7, 3, 1],
            "sector_id": [38, 1, 13, 25, 3],
        },
    }

Solution 1: Iterate over the dictionary with for loops

Here we can have two for loops to check all elements. The only thing to know is how to intersect two list in python with set.

results = {}
for user_a, category_sector_a in d.items():
    results[user_a] = []
    for user_b, category_sector_b in d.items():
        if user_a != user_b:
            # we use "set" to have common elements between the two lists
            intersection_category = set(category_sector_a["category_id"]) & set(
                category_sector_a["category_id"]
            )
            intersection_sector = set(category_sector_a["sector_id"]) & set(
                category_sector_a["sector_id"]
            )
            if (len(intersection_category)) > 0 or (len(intersection_category) > 0):
                results[user_a].append(user_b)

Solution 2: itertools

Here, we use itertools to generate all combinations of keys in the original data. It will allow us to avoid the two for loops.

import itertools

results = {}
for user_a, user_b in itertools.combinations(d.keys(), 2):
    # we use "set" to have common elements between the two lists
    intersection_category = set(d[user_a]["category_id"]) & set(
        d[user_b]["category_id"]
    )
    intersection_sector = set(d[user_a]["sector_id"]) & set(d[user_b]["sector_id"])
    if (len(intersection_category)) > 0 or (len(intersection_category) > 0):
        if user_a in results:
            results[user_a].append(user_b)
        else:
            results[user_a] = [user_b]

It is almost the same thing as previously. Except, at the end we have to create the key in the results dictionary if the key doesn't exist.

Solution 3: itertools and list comprehension

Here, we also use the itertools but in a list comprehension. We use list comprehension to output only users pair respecting the condition (the if part).

import operator
import itertools

results = [
    (user_a, user_b)
    for user_a, user_b in itertools.combinations(d.keys(), 2)
    if (len(set(d[user_a]["category_id"]) & set(d[user_b]["category_id"]))) > 0
    or (len(set(d[user_a]["sector_id"]) & set(d[user_b]["sector_id"])) > 0)
]
results = {
    k: list(list(zip(*g))[1])
    for k, g in itertools.groupby(results, operator.itemgetter(0))
}

Note at the end the part where we need to groupy because the output of the list comprehension is a list of users tuple (pairs). The solution to groupy list of tuples in python comes from this solution on SO.

Thanks a lot! I did the solution 1. – d_b Oct 16 '20 at 11:04 — d_b, Oct 16 '20 at 11:04

Finding common intersection between various values of a columns in a dataframe where values are array/list

1 Answers1

Data

Solution 1: Iterate over the dictionary with for loops

Solution 2: itertools

Solution 3: itertools and list comprehension