How to merge pandas on string contains?

Question

I have 2 dataframes that I would like to merge on a common column. However the column I would like to merge on are not of the same string, but rather a string from one is contained in the other as so:

import pandas as pd
df1 = pd.DataFrame({'column_a':['John','Michael','Dan','George', 'Adam'], 'column_common':['code','other','ome','no match','word']})

df2 = pd.DataFrame({'column_b':['Smith','Cohen','Moore','K', 'Faber'], 'column_common':['some string','other string','some code','this code','word']})

The outcome I would like from d1.merge(d2, ...) is the following:

column_a  |  column_b
----------------------
John      |  Moore    <- merged on 'code' contained in 'some code' 
Michael   |  Cohen    <- merged on 'other' contained in 'other string'  
Dan       |  Smith    <- merged on 'ome' contained in 'some string'  
George    |  n/a
Adam      |  Faber    <- merged on 'word' contained in 'word'

Why is `'John'` not also matched with `K` because of `'this code'` — ALollz, Feb 18 '19 at 22:06
Unfortunately Pandas does not currently have a built-in function for merging on a custom function like string contains — Shimon S, Jan 30 '23 at 09:51

Peter Leimbigler · Accepted Answer · 2019-02-19T19:05:50.117

9

New Answer

Here is one approach based on pandas/numpy.

rhs = (df1.column_common
          .apply(lambda x: df2[df2.column_common.str.find(x).ge(0)]['column_b'])
          .bfill(axis=1)
          .iloc[:, 0])

(pd.concat([df1.column_a, rhs], axis=1, ignore_index=True)
 .rename(columns={0: 'column_a', 1: 'column_b'}))

  column_a column_b
0     John    Moore
1  Michael    Cohen
2      Dan    Smith
3   George      NaN
4     Adam    Faber

Old Answer

Here's a solution for left-join behaviour, as in it doesn't keep column_a values that do not match any column_b values. This is slower than the above numpy/pandas solution because it uses two nested iterrows loops to build a python list.

tups = [(a1, a2) for i, (a1, b1) in df1.iterrows() 
                 for j, (a2, b2) in df2.iterrows()
        if b1 in b2]

(pd.DataFrame(tups, columns=['column_a', 'column_b'])
   .drop_duplicates('column_a')
   .reset_index(drop=True))

  column_a column_b
0     John    Moore
1  Michael    Cohen
2      Dan    Smith
3     Adam    Faber

edited Feb 19 '19 at 19:05

answered Feb 19 '19 at 14:06

Peter Leimbigler

10,775
1
23
37

2

Hm. But it's pure python solution. Could it be done with `pandas` or `numpy`? E.g. with combination of `apply`, `map`, `join`, `merge` etc – Alex Yu Feb 19 '19 at 14:17
@Peter Leimbigler, well, that's one way to do it. I am however looking for something similat to what Alex Yu is suggesting – callmeGuy Feb 19 '19 at 14:22
1

This certainly could be improved! It's the first instinctive solution that came to mind. Does it run in an acceptable time on your amount of data? If so, why bother optimizing? ;) – Peter Leimbigler Feb 19 '19 at 14:39
@callmeGuy Maybe it would be better to clarify in question that "pandas/numpy solution" is requested. Otherwise - this solution is formally correct – Alex Yu Feb 19 '19 at 16:29
1

Gotcha. My old answer performed quite poorly in my limited testing. I've edited in a newer, slightly faster solution - though it's less readable. – Peter Leimbigler Feb 19 '19 at 19:07
1

Ah-ha! That's exactly what I myself tried to find! Thank you! – Alex Yu Feb 19 '19 at 20:01
The code would be more user-friendly (as well as faster) if you use `str.contains(x)` instead of `str.find(x).ge(0)`! (I had to look up `pd.Series.ge()` -- FYI, it evaluates greater-than-or-equal-to, elementwise.) – MMelnicki Aug 06 '20 at 14:55
I think this wont work if the join could create multiple rows – Eduardo EPF Oct 16 '21 at 19:16

Chris Decker · Answer 2 · 2022-01-20T19:47:11.753

My solution involves applying a function to the common column. I can't imagine it holds up well when df2 is large but perhaps someone more knowledgeable than I can suggest an improvement.

def strmerge(strcolumn):
    for i in df2['column_common']:
        if strcolumn in i:
            return df2[df2['column_common'] == i]['column_b'].values[0]

df1['column_b'] = df1['column_common'].apply(strmerge)

df1
    column_a    column_common   column_b
0   John        code            Moore
1   Michael     other           Cohen
2   Dan         ome             Smith
3   George      no match        None
4   Adam        word            Faber

score 0 · Answer 3 · answered Jul 15 '22 at 09:08

A simple, readable and purely vectorized approach could be to have a cross join and then filter where columns column_common of one is substring of other:

df = df1.merge(df2, how='cross')
df.loc[df.column_common_x.eq('no match'),'column_b'] = pd.NA
df.loc[df.apply(lambda x:x.column_common_y.__contains__(x.column_common_x) or x.column_common_x == 'no match', axis=1), 
       ['column_a', 'column_b']].drop_duplicates(subset=['column_a'])

Output:

column_a	column_b
John	Moore
Michael	Cohen
Dan	Smith
George
Adam	Faber

How to merge pandas on string contains?

3 Answers3

New Answer

Old Answer

Linked

Related