While this question is closely related to Fuzzy Match Merge with Pandas, this question is specifically about merging only (or subsetting, in this case) when the key in one DataFrame
is a full match, or substring of the key in another DataFrame
. To illustrate my point, here are 2 DataFrames
:
df1
id code
0 1 E282
1 2 O0080
2 3 R52
3 4 J0100
4 5 F99
df2
code val
0 V282 11
1 O008 12
2 J0101 13
3 F99 14
4 R55 15
The problem with using difflib
is that I really don't want to match the closest string, and I'm not sure I'd be able to separate matches like V282
to E282
, which shouldn't happen and a match like O008
to O0080
which should merge.
The expected output should be
code1 id
0 O0080 2
1 F99 5
I can get to this result with
import numpy as np
df1[np.logical_or.reduce([df1['code'].str.contains(code) for code in df2.code.tolist()])]
but since df1
is 42M rows long and df2
contains ~4000 codes, this method is unbelievably slow. Is this the best I'm going to do? It just seems unfortunate, when inner merging a 21M row df and a 7M row df on exact keys takes < 1 minute.