Perform a merge based on a substring match?

Question

While this question is closely related to Fuzzy Match Merge with Pandas, this question is specifically about merging only (or subsetting, in this case) when the key in one DataFrame is a full match, or substring of the key in another DataFrame. To illustrate my point, here are 2 DataFrames:

df1
   id   code
0   1   E282
1   2  O0080
2   3    R52
3   4  J0100
4   5    F99

df2
    code  val
0   V282   11
1   O008   12
2  J0101   13
3    F99   14
4    R55   15

The problem with using difflib is that I really don't want to match the closest string, and I'm not sure I'd be able to separate matches like V282 to E282, which shouldn't happen and a match like O008 to O0080 which should merge.

The expected output should be

   code1  id
0  O0080   2
1    F99   5

I can get to this result with

import numpy as np
df1[np.logical_or.reduce([df1['code'].str.contains(code) for code in df2.code.tolist()])]

but since df1 is 42M rows long and df2 contains ~4000 codes, this method is unbelievably slow. Is this the best I'm going to do? It just seems unfortunate, when inner merging a 21M row df and a 7M row df on exact keys takes < 1 minute.

Will the `df1` code always start with the `df2` code - The extra characters always at the end? — wwii, May 26 '18 at 00:53
@coldspeed yeah it definitely sped it up. It only takes like 15% the time now, which is manageable. — ALollz, May 26 '18 at 16:43

score 2 · Accepted Answer · answered May 25 '18 at 23:53

2

This is a difficult problem. Maybe consider a python approach? any will shortcircuit here, so you should save on some cycles. Also, contains does not necessarily check from the beginning, so using startswith instead should be more efficient.

df1[
    any(
         i.startswith(j) for j in df2.codes.tolist()
    ) for i in df1.codes.tolist()
]

answered May 25 '18 at 23:53

cs95

379,657
97
704
746

If you run into memory issues, remove the `.tolist` calls, but that'll result in some performance drop. – cs95 May 25 '18 at 23:56

score 2 · Answer 2 · answered Jul 14 '20 at 12:22

What about:

import pandas as pd

df1 = pd.DataFrame({'id':[1,2,3,4,5], 'code':['E282', 'O0080', 'R52', 'J0100', 'F99']})
df2 = pd.DataFrame({'code':['V282','O008','J0101','F99','R55'], 'val':[11,12, 13, 14, 15]})

pat = "|".join(df2['code'])
df1.insert(0, 'part_code', df1['code'].str.extract("(" + pat + ')', expand=False))
pd.merge(df1, df2, how='inner', left_on='part_code', right_on='code')[['code_y', 'id']]

Inspired by https://stackoverflow.com/a/48744202/12256369

Perform a merge based on a substring match?

2 Answers2