1

I want to compare the col1 and the col2 which one is matched in partial. I wrote the code like below, but the first row show 'Misamtch' not 'Match'

import pandas as pd

data_in = {'col1': ['BANQ1049576495', 'HLCUSEL221162979', 'SEL1469779'],
           'col2': ['KNKX1049576495', 'SEL221162979', 'KROL1020107403']}
df_input = pd.DataFrame(data_in)

data_out = {'col1': ['BANQ1049576495','HLCUSEL221162979','SEL1469779'],
            'col2': ['KNKX1049576495','SEL221162979','KROL1020107403'],
            'col3':['Match','Match','Mismatch']}
df_ouput = pd.DataFrame(data_out)


def compare_func(row):
pattern = re.compile('.*' + re.escape(row['col1']) + '.*')
if re.search(pattern, row['col2']):
return 'Match'
else:
return 'Mismatch'  

data_in['col3'] = data_in.apply(compare_func, axis=1)
data_ouput = data_in

Can you Please fit the code so that even if the values in two columns are partially matched, it will be checked as a match?

Ynjxsjmh
  • 28,441
  • 6
  • 34
  • 52
Dew
  • 13
  • 2
  • Can you clarify what constitutes a "match"? Is it just the letters (i.e., ignoring numbers)? Or numbers (ignoring letters)? Or something else – DarkKnight May 15 '23 at 04:29

4 Answers4

2

If you want to compare values row by row, you can use str.extract:

c1 = df_input['col1'].str.extract(r'(\d+)$', expand=False)
c2 = df_input['col2'].str.extract(r'(\d+)$', expand=False)
df_input['col3'] = np.where(c1 == c2, 'Match', 'Mismatch')

Output:

>>> df_input
               col1            col2      col3
0    BANQ1049576495  KNKX1049576495     Match
1  HLCUSEL221162979    SEL221162979     Match
2        SEL1469779  KROL1020107403  Mismatch
Corralien
  • 109,409
  • 8
  • 28
  • 52
1

IIUC, you can use :

#I'm using `df` instead of `df_input`
#stackoverflow.com/a/71899589/15239951 
#stackoverflow.com/a/69169135/15239951

#pip install thefuzz
#pip install python-Levenshtein
from thefuzz import fuzz

R = 70 # feel free to adjust the ratio

df["col3 (fw)"] = ["Match" if fuzz.ratio(c1, c2) >= R else "Mismatch"
                   for (c1, c2) in zip(df["col1"], df["col2"])]

Or, following your regex approach, match on the sequence of numbers :

def match_str(s, regex=r"(?<=[A-Z])[0-9]+"):
    return re.search(regex, s).group() if re.search(regex, s) else None

df["col3 (re)"] = ["Match" if match_str(s1) == match_str(s2) else "Mismatch"
                   for s1, s2 in zip(df["col1"], df["col2"])]

Output :

print(df)

               col1            col2 col3 (fw) col3 (re)
0    BANQ1049576495  KNKX1049576495     Match     Match
1  HLCUSEL221162979    SEL221162979     Match     Match
2        SEL1469779  KROL1020107403  Mismatch  Mismatch
Timeless
  • 22,580
  • 4
  • 12
  • 30
  • Don't use `fuzzywuzzy` package but `fuzz`: https://stackoverflow.com/a/71899589/15239951 – Corralien May 15 '23 at 04:46
  • It's like comparing 2 dataframes: https://stackoverflow.com/a/69169135/15239951 :) – Corralien May 15 '23 at 04:48
  • Oops, thanks Corralien ;) I made an update but not sure if it's the right way! Also, I don't think we need a fuzzy match here because it seems that the OP just wants to match the sequence of numbers and if it's the case, your answer is by far the best for this usecase. – Timeless May 15 '23 at 04:55
  • 1
    Important note: `pip install python-Levenshtein` (a C extension) – Corralien May 15 '23 at 04:57
0

It is not clear in your question how "partially match" is defined. From your provided examples, it seems that two strings are "partially matched", if they have the same suffix, and the suffix length no less than the length of numbers in the string. I will give my answer based on this observation.

You can use the following code:

import re

# str1 and str2 are the two strings
def is_partially_match(str1, str2):
    letters1, nums1, _ = re.split('(\d+)', str1)
    letters2, nums2, _ = re.split('(\d+)', str2)
    return 'Match' if nums1 == nums2 else 'Mismatch'
Terence
  • 1
  • 1
  • I don't know how the data will come in, so it's hard to set criteria. If I'm comparing values in two columns and matching them if they meet at least 5 characters, how do I write the code? – Dew May 15 '23 at 04:56
0

It looks as though you're trying to match on the numeric part in which case:

import re

data_in = {
    'col1': ['BANQ1049576495', 'HLCUSEL221162979', 'SEL1469779'],
    'col2': ['KNKX1049576495', 'SEL221162979', 'KROL1020107403']
}

FIND = re.compile(r'\d+')

for a, b in zip(data_in['col1'], data_in['col2']):
    m = 'Match' if FIND.findall(a) == FIND.findall(b) else 'Mismatch'
    data_in.setdefault('col3', []).append(m)

print(data_in)

Output:

{'col1': ['BANQ1049576495', 'HLCUSEL221162979', 'SEL1469779'], 'col2': ['KNKX1049576495', 'SEL221162979', 'KROL1020107403'], 'col3': ['Match', 'Match', 'Mismatch']}
DarkKnight
  • 19,739
  • 3
  • 6
  • 22