1

While this question is closely related to Fuzzy Match Merge with Pandas, this question is specifically about merging only (or subsetting, in this case) when the key in one DataFrame is a full match, or substring of the key in another DataFrame. To illustrate my point, here are 2 DataFrames:

df1
   id   code
0   1   E282
1   2  O0080
2   3    R52
3   4  J0100
4   5    F99

df2
    code  val
0   V282   11
1   O008   12
2  J0101   13
3    F99   14
4    R55   15

The problem with using difflib is that I really don't want to match the closest string, and I'm not sure I'd be able to separate matches like V282 to E282, which shouldn't happen and a match like O008 to O0080 which should merge.

The expected output should be

   code1  id
0  O0080   2
1    F99   5

I can get to this result with

import numpy as np
df1[np.logical_or.reduce([df1['code'].str.contains(code) for code in df2.code.tolist()])]

but since df1 is 42M rows long and df2 contains ~4000 codes, this method is unbelievably slow. Is this the best I'm going to do? It just seems unfortunate, when inner merging a 21M row df and a 7M row df on exact keys takes < 1 minute.

cs95
  • 379,657
  • 97
  • 704
  • 746
ALollz
  • 57,915
  • 7
  • 66
  • 89

2 Answers2

2

This is a difficult problem. Maybe consider a python approach? any will shortcircuit here, so you should save on some cycles. Also, contains does not necessarily check from the beginning, so using startswith instead should be more efficient.

df1[
    any(
         i.startswith(j) for j in df2.codes.tolist()
    ) for i in df1.codes.tolist()
]
cs95
  • 379,657
  • 97
  • 704
  • 746
  • If you run into memory issues, remove the `.tolist` calls, but that'll result in some performance drop. – cs95 May 25 '18 at 23:56
2

What about:

import pandas as pd

df1 = pd.DataFrame({'id':[1,2,3,4,5], 'code':['E282', 'O0080', 'R52', 'J0100', 'F99']})
df2 = pd.DataFrame({'code':['V282','O008','J0101','F99','R55'], 'val':[11,12, 13, 14, 15]})

pat = "|".join(df2['code'])
df1.insert(0, 'part_code', df1['code'].str.extract("(" + pat + ')', expand=False))
pd.merge(df1, df2, how='inner', left_on='part_code', right_on='code')[['code_y', 'id']]

Inspired by https://stackoverflow.com/a/48744202/12256369

abogaard
  • 339
  • 3
  • 8