2

I have two data source:

df1 = pd.DataFrame({'a': [1, 2, 3, 4]})
df2 = pd.DataFrame({'b': [
    'Some text 11.',
    'Good 2 number',
    'Other 33 not valid',
    '4 is good too even with 7'
]})

df1 look like

   a
0  1
1  2
2  3
3  4

and the df2 looks like

                           b
0              Some text 11.
1              Good 2 number
2         Other 33 not valid
3  4 is good too even with 7

My goal is merge them if number from the column df1.a exists in any position in the column df2.b, but only exact number. so the result should be:

a                         b
2              Good 2 number
4  4 is good too even with 7

I can modify the solution on string contains but looks very complicated. In the real data the both source has more than 20k records.

Brown Bear
  • 19,655
  • 10
  • 58
  • 76

1 Answers1

1

You can get all numeric by Series.str.extractall to DataFrame with Series.unstack, convert to integers and match by DataFrame.isin with test if at least one True per rows by DataFrame.any:

mask = (df2['b'].str.extractall('(\d+)')[0]
                .astype(int)
                .unstack()
                .reindex(df2.index)
                .isin(df1['a'])
                .any(axis=1))
df = df2[mask]
print (df)
                           b
1              Good 2 number
3  4 is good too even with 7

You can join all values of df1['a'] by | for regex or with \b\b for words boundaries and pass to Series.str.contains:

pat = '|'.join(r"\b{}\b".format(x) for x in df1['a'])
df = df2[df2['b'].str.contains(pat)]
print (df)
                           b
1              Good 2 number
3  4 is good too even with 7

If need solution with merge:

s = df2['b'].str.extractall('(\d+)')[0].astype(int).reset_index(level=1, drop=True)

df = df1.merge(df2.join(s.rename('a')),on='a')
print (df)
   a                          b
0  2              Good 2 number
1  4  4 is good too even with 7
    
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thank you for your answer.I'm sorry to miss information that both source have more than 20k records on the test environment and possible more in the prod – Brown Bear Nov 03 '20 at 08:49
  • @BrownBear - tehn second should be better, I guess – jezrael Nov 03 '20 at 08:50
  • 1
    the second solution is that I was searching for, thank you once again. May be it can be put to the top of the answer ) – Brown Bear Nov 03 '20 at 09:07