2

Below is a subset of a pandas data frame that I have

           index             name_matches dist_matches
38  PO1000000345                  M-00346      M-00346
39  PO1000000352                               M-00804
40  PO1000000354                  M-00196      M-00196
41  PO1000000355                  M-00514      M-00514
42  PO1000000382          M-00353,M-00354      M-00354
43  PO1000000411                                      
44  PO1000000451                                      
45  PO1000000512                               M-00680
46  PO1000000530                  M-00089             
47  PO1000000531                  M-00087      M-00087
48  PO1000000553  M-00917,M-00920,M-00922      M-00920

I am trying to get a new column (comb_matches) which pulls out the matching value in the name_matches and dist_matches columns. Occasionally, there will be one or more values in the column separated by commas. An example of the output I am looking to get is shown below.

           index             name_matches dist_matches  comb_matches
38  PO1000000345                  M-00346      M-00346       M-00346
39  PO1000000352                               M-00804
40  PO1000000354                  M-00196      M-00196       M-00196
41  PO1000000355                  M-00514      M-00514       M-00514
42  PO1000000382          M-00353,M-00354      M-00354       M-00354
43  PO1000000411                                      
44  PO1000000451                                      
45  PO1000000512                               M-00680
46  PO1000000530                  M-00089             
47  PO1000000531                  M-00087      M-00087       M-00087
48  PO1000000553  M-00917,M-00920,M-00922      M-00920       M-00920

Is there any easy way to get the above?

Funkeh-Monkeh
  • 649
  • 6
  • 17

2 Answers2

5

No easy way. Pandas isn't designed for this kind of task, it's not vectorisable. Your best option may be a list comprehension:

s1 = df['dist_matches'].astype(str)
s2 = df['name_matches'].astype(str).str.split(',')
mask = [i in j for i, j in zip(s1, s2)]

df['comb_match'] = np.where(mask, df['dist_matches'], np.nan)

Performance benchmarking

To demonstrate the fact Pandas str methods aren't truly vectorised:

# Python 3.6.5, Pandas 0.23.0

def wen(df):
    Bool = df.name_matches.str.split(',',expand=True).isin(df.dist_matches).any(1)    
    df['comb_match'] = np.where(Bool, df.dist_matches, '')
    return df

def jpp(df):
    s1 = df['dist_matches'].astype(str)
    s2 = df['name_matches'].astype(str).str.split(',')
    mask = [i in j for i, j in zip(s1, s2)]
    df['comb_match'] = np.where(mask, df['dist_matches'], np.nan)
    return df

df = pd.concat([df]*1000, ignore_index=True)

assert jpp(df).equals(wen(df))

%timeit jpp(df)  # 12.2 ms
%timeit wen(df)  # 32.7 ms
jpp
  • 159,742
  • 34
  • 281
  • 339
4

Using str.split before isin . then we implement the Boolean to np.where

Bool=df.name_matches.str.split(',',expand=True).isin(df.dist_matches).any(1)    
df['comb_match']=np.where(Bool,df.dist_matches,'')
df
Out[520]: 
           index             name_matches dist_matches comb_match
38  PO1000000345                  M-00346      M-00346    M-00346
39  PO1000000352                               M-00804           
40  PO1000000354                  M-00196      M-00196    M-00196
41  PO1000000355                  M-00514      M-00514    M-00514
42  PO1000000382          M-00353,M-00354      M-00354    M-00354
43  PO1000000411                                                 
44  PO1000000451                                                 
45  PO1000000512                               M-00680           
46  PO1000000530                  M-00089                        
47  PO1000000531                  M-00087      M-00087    M-00087
48  PO1000000553  M-00917,M-00920,M-00922      M-00920    M-00920
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Hey wen, mind explaining the `.any` method? Very cool and wen I learn a lot from you. – Umar.H Oct 29 '18 at 18:55
  • 1
    @Datanovice any will return True if any of the value is True , only return False when there are all False . here i means axis=1 , which check row by row . – BENY Oct 29 '18 at 18:57
  • Thanks for this @Wen, this works great. Unfortunately, I just came across another situation where there are also values separated by columns in the `dist_matches`. This code would need to be modified slightly for that right? – Funkeh-Monkeh Oct 29 '18 at 19:06
  • 1
    @Funkeh-Monkeh yes , then you need to look into to set intersection – BENY Oct 29 '18 at 19:07