Python: how to identify common elements in lists from two dataframes' series

Question

Using Pandas, I have two data sets stored in two separate dataframes. Each dataframe is composed of two series.

The first dataframe has a series called 'name', the second series is a list of strings. It looks something like this:

                  name                           attributes
0                 John  [ABC, DEF, GHI, JKL, MNO, PQR, STU]
1                 Mike  [EUD, DBS, QMD, ABC, GHI]
2                 Jane  [JKL, EJD, MDE, MNO, DEF, ABC]
3                Kevin  [FHE, EUD, GHI, MNO, ABC, AUE, HSG, PEO]
4             Stefanie  [STU, EJD, DUE]

The second dataframe is similar with the first series being

              username                                 attr
0           username_1  [DHD, EOA, AUE, CHE, ABC, PQR, QJF]
1           username_2  [ABC, EKR, ADT, GHI, JKL, EJD, MNO, MDE]
2           username_3  [DSB, AOD, DEF, MNO, DEF, ABC, TAE]
3           username_4  [DJH, EUD, GHI, MNO, ABC, FHE]
4           username_5  [CHQ, ELT, ABC, DEF, GHI]

What I'm trying to achieve is to compare the attributes (second series) of each dataframe to see which names and usernames share the most attributes.

For example, username_4 has 5 out of 6 attributes matching those of Kevin's.

I thought of looping one of the attributes series and see if there's a match in each row of the other series but couldn't loop effectively (maybe because my lists don't have quotation marks around the strings?).

I don't really know what possibilities exist to compare those two series and end up with a result as mentioned above (username_4 has 5 out of 6 attributes matching those of Kevin's).

What would be the possible approach(es) here?

score 0 · Answer 1 · answered Nov 14 '22 at 08:40

You could try a method like below:

# Import pandas library
import pandas as pd

# Create our data frames
data1 = [['John', ['ABC', 'DEF', 'GHI', 'JKL', 'MNO', 'PQR', 'STU']], ['Mike', ['EUD', 'DBS', 'QMD', 'ABC', 'GHI']],
['Jane', ['JKL', 'EJD', 'MDE', 'MNO', 'DEF', 'ABC']], ['Kevin', ['FHE', 'EUD', 'GHI', 'MNO', 'ABC', 'AUE', 'HSG', 'PEO']], 
['Stefanie', ['STU', 'EJD', 'DUE']]]

data2 = [['username_1', ['DHD', 'EOA', 'AUE', 'CHE', 'ABC', 'PQR', 'QJF']], ['username_2', ['ABC', 'EKR', 'ADT', 'GHI', 'JKL', 'EJD', 'MNO', 'MDE']],
['username_3', ['DSB', 'AOD', 'DEF', 'MNO', 'DEF', 'ABC', 'TAE']], ['username_4', ['DJH', 'EUD', 'GHI', 'MNO', 'ABC', 'FHE']], 
['username_5', ['CHQ', 'ELT', 'ABC', 'DEF', 'GHI']]]
  
# Create the pandas DataFrames with column name is provided explicitly
df1 = pd.DataFrame(data1, columns=['name', 'attributes'])
df2 = pd.DataFrame(data2, columns=['username', 'attr'])

# Create helper function to compare our two data frames
def func(inputDataFrame2, inputDataFrame1):
    outputDictionary = {} # Set a dictionary for our output
    for i, r in inputDataFrame2.iterrows(): # Loop over items in second data frame
        dictBuilder = {}
        for index, row in inputDataFrame1.iterrows(): # Loop over items in first data frame
            name = row['name']
            dictBuilder[name] = len([w for w in r['attr'] if w in row['attributes']]) # Get count of items in both lists
        maxKey = max(dictBuilder, key=dictBuilder.get) # Get the max value from the list of repeated items
        outputDictionary[r['username']] = [maxKey, dictBuilder[maxKey]] # Add name and count of attribute matches to dictionary
    print(outputDictionary) # Debug print statement
    return outputDictionary # Return our output dictionary here for further processing


a = func(df2, df1)

That should yield an output like below:

{'username_1': ['John', 2], 'username_2': ['Jane', 5], 'username_3': ['John', 4], 'username_4': ['Kevin', 5], 'username_5': ['John', 3]}

Where each item in the dictionary returned from outputDictionary will have:

Dictionary key value equal to the username from the second data frame
Dictionary value equal to a list, containing the name and count with the most matches as compared to our first data frame

Note that this method could be optimized in how it loops over each row in the two data frames - The thread below describes a few different ways to process rows in data frames:

How to iterate over rows in a DataFrame in Pandas

Python: how to identify common elements in lists from two dataframes' series

1 Answers1