1

I have a dataframe where I am trying to match the columns string values of two columns to create a new column that returns true if the two column values match or false if they don't. Want to use match and regex, remove all non-alphanumeric characters and use lowercase to match the names

pattern = re.compile('[^a-zA-Z]')

    Name A         Name B
0   yGZ,)          ygz.
1   (CGI)          C.G.I
2   Exto           exto.
3   Golden         UTF

I was thinking of trying something like this:

dataframe['Name A', 'Name B'].str.match(pattern, flags= re.IGNORECASE)

    Name A         Name B    Result
0   yGZ,)          ygz.       True
1   (CGI)          C.G.I      True
2   Exto           exto.      True
3   Golden         UTF        False
TH14
  • 622
  • 10
  • 24

2 Answers2

3

Can use pd.DataFrame.replace to clean your strings, and then compare using eq. Of course, if you wish to maintain a copy of your original df, just assign the returned data frame to a new variable ;}

df = df.replace("[^a-zA-Z0-9]", '', regex=True)

Then

df['Result'] = df['Name A'].str.lower().eq(df['Name B'].str.lower())

Outputs

    Name A  Name B  Result
0   yGZ     ygz     True
1   CGI     CGI     True
2   Exto    exto    True
3   Golden  UTF     False
rafaelc
  • 57,686
  • 15
  • 58
  • 82
  • for foreign characters the cleaned version of the column shows up as blank, is there any fix for that? – TH14 Apr 09 '19 at 05:16
  • @TH14 you just slice the columns first to apply the replace in the data frame of interest... for example, `df[cols].replace(...)` where `cols=['Name A', 'Name B']` for instance. – rafaelc Apr 09 '19 at 23:14
1

You can use str.replace to remove punctuation (also see another post of mine, Fast punctuation removal with pandas), then

u = df.apply(lambda x: x.str.replace(r'[^\w]', '').str.lower())
df['Result'] = u['Name A'] == u['Name B']
df

   Name A Name B  Result
0   yGZ,)   ygz.    True
1   (CGI)  C.G.I    True
2    Exto  exto.    True
3  Golden    UTF   False
cs95
  • 379,657
  • 97
  • 704
  • 746
  • Im getting this error AttributeError: ('Can only use .str accessor with string values, which use np.object_ dtype in pandas', 'occurred at index Number') sorry I shouldve mentioned there are other columns which have numeric values – TH14 Apr 09 '19 at 04:45
  • 1
    @TH14 Change the first line to: `u = df[['Name A', 'Name B']].apply(lambda x: x.str.replace(r'[^\w]', '').str.lower())` and it should work – cs95 Apr 09 '19 at 04:50