Perform matching in between Two lists in Python

Question

I have two Tables. I want to compare two columns and want to get matches row counts and row numbers. How can I get the expected result using Python. df1:

Name	Score	Year
Pat	82	1990
Chris	38	1993
Pat	92	1994
Noris	88	1997
Mit	62	1999
Chen	58	1996

df2:

Applicant
Pat
Chris
Meet

Expected result

Applicant	Match (Y/N)	Matched Row reference	Count
Pat	Y	1,3	2
Chris	Y	2	1
Meet	N	NA	0

I used pd.merge and was able to got the columns from df1 where it was matching. However I am not able to get the expected result with matched raw reference, match count and NA for non matching values — Tushar Patel, Sep 06 '22 at 21:53

DarrylG · Answer 1 · 2022-09-07T00:58:59.490

Approach based on Pandas outer merge

Outer merge of df1 & df2 creates all required rows
Group by Applicant so we can aggregate count
Use a function to aggregate and produce values for desired output rows
To keep index of df1 after merge, use method from to keep index after merge

Code

import pandas as pd
import numpy as np

def process(df1, df2):
    ' Overall function for generating desired output '
    
    def create_result(df, columns = ["Match (Y/N)", "Matched Row reference", "Count"]):
        '''
            Creates the desired columns of df2

            Input:
                df      - Dataframe from groupby
                columns - column names for df2
            Output:
                Pandas Series corresponding to row in df2
        '''
        cnt = df['Name'].count()   # Number of items in group
        if cnt > 0:
            # Convert index to comma delimited list, numbered from 1 (i.e. int(x) + 1)
            indexes = ','.join(str(int(x) + 1) for x in df.index.to_list())
        else:
            indexes = "NA"   # empty dataframe

        lst = ["Y" if cnt > 0 else 'N', 
                indexes,
                df.shape[0] if cnt > 0 else 0]

        return pd.Series(lst, index = columns)

    # Merge df1 with df2 but
    # add method from [to keep index after merge](https://stackoverflow.com/questions/11976503/how-to-keep-index-when-using-pandas-merge/11982843#11982843)
    # to have the index of df1 in the merge result
    return (df1
            .reset_index()
            .merge(df2, left_on = "Name", right_on = 'Applicant', how = "outer")
            .set_index('index')
            .groupby(['Applicant'])
            .apply(lambda grp_df: create_result(grp_df)))

Usage

from io import StringIO

s = '''Name Score   Year
Pat 82  1990
Chris   38  1993
Pat 92  1994
Noris   88  1997
Mit 62  1999
Chen    58  1996'''

df1 = pd.read_csv(StringIO(s), sep = '\t', engine = 'python')

s = '''Applicant
Pat
Chris
Meet'''

df2 = pd.read_csv(StringIO(s), sep = '\t', engine = 'python')

from pprint import pprint as pp
pp(process(df1, df2))            # process and pretty print result

Output

                   Match (Y/N) Matched Row reference  Count
Applicant                                         
Chris               Y                     2           1
Meet                N                    NA           0
Pat                 Y                   1,3           2

score 0 · Answer 2 · answered Sep 06 '22 at 23:50

I would use numpy and pandas for this. Because I belive that Pandas is the great libraries for dealing with huge data. Although you do not great number of data, I would still recommend you to use pandas.

For information about pandas https://pandas.pydata.org/

You are able to create list file with pandas

data = {'Name': ListForName, 
    'Score': ListForScore, 
    'Year': ListForScore}

For more information about creating a list. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html

I would use a basic for loop for matching. For example.

match = 0
for i in range(0, FirstList):
    for j in range(0, SecondList):
        if(FirstList['Colunm'].iloc[i] == SecondList['Colunm'].iloc[j)):
             match += 1

Perform matching in between Two lists in Python

2 Answers2