131

I have two DataFrames which I want to merge based on a column. However, due to alternate spellings, different number of spaces, absence/presence of diacritical marks, I would like to be able to merge as long as they are similar to one another.

Any similarity algorithm will do (soundex, Levenshtein, difflib's).

Say one DataFrame has the following data:

df1 = DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])

       number
one         1
two         2
three       3
four        4
five        5

df2 = DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])

      letter
one        a
too        b
three      c
fours      d
five       e

Then I want to get the resulting DataFrame

       number letter
one         1      a
two         2      b
three       3      c
four        4      d
five        5      e
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
pocketfullofcheese
  • 8,427
  • 9
  • 41
  • 57
  • 1
    Yes with [d6tjoin](https://github.com/d6t/d6tjoin) see [MergeTop1 notebook](https://github.com/d6t/d6tjoin/blob/master/examples-top1.ipynb) – citynorman Jan 20 '19 at 05:38
  • 2
    The accepted solution fails in the cases where no close matches are found. For a simple way around, [see this alternative](https://stackoverflow.com/a/60908516/9698684) – yatu Sep 15 '20 at 06:59

16 Answers16

108

Similar to @locojay suggestion, you can apply difflib's get_close_matches to df2's index and then apply a join:

In [23]: import difflib 

In [24]: difflib.get_close_matches
Out[24]: <function difflib.get_close_matches>

In [25]: df2.index = df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])

In [26]: df2
Out[26]: 
      letter
one        a
two        b
three      c
four       d
five       e

In [31]: df1.join(df2)
Out[31]: 
       number letter
one         1      a
two         2      b
three       3      c
four        4      d
five        5      e

.

If these were columns, in the same vein you could apply to the column then merge:

df1 = DataFrame([[1,'one'],[2,'two'],[3,'three'],[4,'four'],[5,'five']], columns=['number', 'name'])
df2 = DataFrame([['a','one'],['b','too'],['c','three'],['d','fours'],['e','five']], columns=['letter', 'name'])

df2['name'] = df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])[0])
df1.merge(df2)
das-g
  • 9,718
  • 4
  • 38
  • 80
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • 1
    Does anyone know if there is a way to do this between rows of one column? I'm trying to find duplicates that might have typos – As3adTintin Oct 29 '15 at 20:17
  • 2
    you can use n=1 to limit the results to 1. https://docs.python.org/3/library/difflib.html#difflib.get_close_matches – Bastian Jul 13 '16 at 10:54
  • 4
    How to go about it if the two dataframes have different lengths? – famargar Jan 23 '18 at 12:28
  • This solution fails in many places and I prefer https://github.com/d6t/d6tjoin. You can customize similarity function, affinegap is a better similarity metric, it's multi-core for faster compute, deals with duplicates matches etc – citynorman Jan 18 '19 at 19:07
  • This solution fails in the cases where no close matches where found. [Here's a simple way around](https://stackoverflow.com/a/60908516/9698684) – yatu May 05 '20 at 09:35
  • 3
    For those that say it fails, I think that is more of an issue of how to implement this into your pipeline, and not a fault of the solution, which is simple and elegant. – Maxim Oct 13 '20 at 18:26
  • 1
    I used a similar solution but used `[:1]` to resize the result list from `get_close_matches` and make sure it doesn't throw a `KeyError` – F. Leone Oct 23 '20 at 15:29
  • Is it possible to return matches _with the similiarity score included_ with difflib? – alphabetasoup May 09 '22 at 21:24
73

Using fuzzywuzzy

Since there are no examples with the fuzzywuzzy package, here's a function I wrote which will return all matches based on a threshold you can set as a user:


Example datframe

df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})
df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})

# df1
          Key
0       Apple
1      Banana
2      Orange
3  Strawberry

# df2
        Key
0      Aple
1     Mango
2      Orag
3     Straw
4  Bannanna
5     Berry

Function for fuzzy matching

def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
    """
    :param df_1: the left table to join
    :param df_2: the right table to join
    :param key1: key column of the left table
    :param key2: key column of the right table
    :param threshold: how close the matches should be to return a match, based on Levenshtein distance
    :param limit: the amount of matches that will get returned, these are sorted high to low
    :return: dataframe with boths keys and matches
    """
    s = df_2[key2].tolist()
    
    m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))    
    df_1['matches'] = m
    
    m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
    df_1['matches'] = m2
    
    return df_1

Using our function on the dataframes: #1

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

fuzzy_merge(df1, df2, 'Key', 'Key', threshold=80)

          Key       matches
0       Apple          Aple
1      Banana      Bannanna
2      Orange          Orag
3  Strawberry  Straw, Berry

Using our function on the dataframes: #2

df1 = pd.DataFrame({'Col1':['Microsoft', 'Google', 'Amazon', 'IBM']})
df2 = pd.DataFrame({'Col2':['Mcrsoft', 'gogle', 'Amason', 'BIM']})

fuzzy_merge(df1, df2, 'Col1', 'Col2', 80)

        Col1  matches
0  Microsoft  Mcrsoft
1     Google    gogle
2     Amazon   Amason
3        IBM         

Installation:

Pip

pip install fuzzywuzzy

Anaconda

conda install -c conda-forge fuzzywuzzy
Erfan
  • 40,971
  • 8
  • 66
  • 78
  • 7
    is there a way to carry all of df2's columns over to the match? lets say c is a primary or foreign key youd like to keep of table 2 (df2) – Tinkinc Sep 04 '19 at 15:16
  • @Tinkinc did you figure out how to do it? – Fatima Jan 28 '20 at 06:41
  • 1
    hey Erfan, when you get a mo think you could update this to be used with pandas 1.0? i wonder what sort of performance boost it would get if you changed the engine in apply to Cython or Numba – Umar.H Jan 31 '20 at 16:06
  • This solutions looks really promising for my problem as well. But could you explain as to how this will work when I do not have a common column in both the datasets? How can I create a match column in one of the two datasets that gives me the score? I have used your #2 solution. I am not sure why it is taking so much time to run. – Django0602 Feb 11 '20 at 14:19
  • I have read the comments. I will soon update the code which helps the problem of @Tinkinc and try to implement apply with Cython or Numba to speed up code runtime @ Datanvoice. – Erfan Feb 11 '20 at 14:21
  • Sorry, I get `TypeError: expected string or bytes-like object` when I merge using your function, any ideas to solve this issue? – ah bon Mar 11 '20 at 10:07
  • 1
    If you need the matched keys too, you can use `s = df_2.to_dict()[key2]` – suricactus Mar 22 '20 at 15:16
  • https://github.com/seatgeek/fuzzywuzzy/issues/165 describes @suricactus's dict strategy a little more. From there you can merge on the index to get the rest of df2's columns. – snl Jun 17 '20 at 04:39
  • this works well, any way to get this working @Erfan to the whole sheet? so key + the rest of the data. – iHaag Feb 22 '21 at 11:20
  • @iHaag a bit late to the game but I have added the functionality you are looking for in my answer https://stackoverflow.com/a/66833195/5125264 – Matt Mar 27 '21 at 15:49
  • I have asked a new question about @Tinkinc ' comment https://stackoverflow.com/questions/72348033/fuzzy-merge-with-python-pandas – Tom May 23 '22 at 11:48
24

I have written a Python package which aims to solve this problem:

pip install fuzzymatcher

You can find the repo here and docs here.

Basic usage:

Given two dataframes df_left and df_right, which you want to fuzzy join, you can write the following:

from fuzzymatcher import link_table, fuzzy_left_join

# Columns to match on from df_left
left_on = ["fname", "mname", "lname",  "dob"]

# Columns to match on from df_right
right_on = ["name", "middlename", "surname", "date"]

# The link table potentially contains several matches for each record
fuzzymatcher.link_table(df_left, df_right, left_on, right_on)

Or if you just want to link on the closest match:

fuzzymatcher.fuzzy_left_join(df_left, df_right, left_on, right_on)
Erfan
  • 40,971
  • 8
  • 66
  • 78
RobinL
  • 11,009
  • 8
  • 48
  • 68
15

I would use Jaro-Winkler, because it is one of the most performant and accurate approximate string matching algorithms currently available [Cohen, et al.], [Winkler].

This is how I would do it with Jaro-Winkler from the jellyfish package:

def get_closest_match(x, list_strings):

  best_match = None
  highest_jw = 0

  for current_string in list_strings:
    current_score = jellyfish.jaro_winkler(x, current_string)

    if(current_score > highest_jw):
      highest_jw = current_score
      best_match = current_string

  return best_match

df1 = pandas.DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])
df2 = pandas.DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])

df2.index = df2.index.map(lambda x: get_closest_match(x, df1.index))

df1.join(df2)

Output:

    number  letter
one     1   a
two     2   b
three   3   c
four    4   d
five    5   e
lostsoul29
  • 746
  • 2
  • 11
  • 19
11

For a general approach: fuzzy_merge

For a more general scenario in which we want to merge columns from two dataframes which contain slightly different strings, the following function uses difflib.get_close_matches along with merge in order to mimic the functionality of pandas' merge but with fuzzy matching:

import difflib 

def fuzzy_merge(df1, df2, left_on, right_on, how='inner', cutoff=0.6):
    df_other= df2.copy()
    df_other[left_on] = [get_closest_match(x, df1[left_on], cutoff) 
                         for x in df_other[right_on]]
    return df1.merge(df_other, on=left_on, how=how)

def get_closest_match(x, other, cutoff):
    matches = difflib.get_close_matches(x, other, cutoff=cutoff)
    return matches[0] if matches else None

Here are some use cases with two sample dataframes:

print(df1)

     key   number
0    one       1
1    two       2
2  three       3
3   four       4
4   five       5

print(df2)

                 key_close  letter
0                    three      c
1                      one      a
2                      too      b
3                    fours      d
4  a very different string      e

With the above example, we'd get:

fuzzy_merge(df1, df2, left_on='key', right_on='key_close')

     key  number key_close letter
0    one       1       one      a
1    two       2       too      b
2  three       3     three      c
3   four       4     fours      d

And we could do a left join with:

fuzzy_merge(df1, df2, left_on='key', right_on='key_close', how='left')

     key  number key_close letter
0    one       1       one      a
1    two       2       too      b
2  three       3     three      c
3   four       4     fours      d
4   five       5       NaN    NaN

For a right join, we'd have all non-matching keys in the left dataframe to None:

fuzzy_merge(df1, df2, left_on='key', right_on='key_close', how='right')

     key  number                key_close letter
0    one     1.0                      one      a
1    two     2.0                      too      b
2  three     3.0                    three      c
3   four     4.0                    fours      d
4   None     NaN  a very different string      e

Also note that difflib.get_close_matches will return an empty list if no item is matched within the cutoff. In the shared example, if we change the last index in df2 to say:

print(df2)

                          letter
one                          a
too                          b
three                        c
fours                        d
a very different string      e

We'd get an index out of range error:

df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])

IndexError: list index out of range

In order to solve this the above function get_closest_match will return the closest match by indexing the list returned by difflib.get_close_matches only if it actually contains any matches.

yatu
  • 86,083
  • 12
  • 84
  • 139
  • I'd suggest using `apply` to make it faster: `df_other[left_on] = df_other[right_on].apply(lambda x: get_closest_match(x, df1[left_on], cutoff))` – irene Jun 12 '20 at 17:03
  • apply isn't faster than list comps @irene :) check https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas/55557758#55557758 – yatu Jun 12 '20 at 17:07
  • Hmm...I just tried the same code, it was visibly faster for the data that I had. Maybe it's data-dependent? – irene Jun 12 '20 at 17:13
  • Normally for reliable timings you need benchmarking on large sample sizes. But on my experience, list-comps are usually as fast or faster @irene Also do note that apply is basically just looping over the rows too – yatu Jun 12 '20 at 17:14
  • 1
    Got it, will try list comprehensions next time `apply` is to slow for me. Thanks! – irene Jun 15 '20 at 11:13
7

http://pandas.pydata.org/pandas-docs/dev/merging.html does not have a hook function to do this on the fly. Would be nice though...

I would just do a separate step and use difflib getclosest_matches to create a new column in one of the 2 dataframes and the merge/join on the fuzzy matched column

jassinm
  • 7,323
  • 3
  • 33
  • 42
5

I used Fuzzymatcher package and this worked well for me. Visit this link for more details on this.

use the below command to install

pip install fuzzymatcher

Below is the sample Code (already submitted by RobinL above)

from fuzzymatcher import link_table, fuzzy_left_join

# Columns to match on from df_left
left_on = ["fname", "mname", "lname",  "dob"]

# Columns to match on from df_right
right_on = ["name", "middlename", "surname", "date"]

# The link table potentially contains several matches for each record
fuzzymatcher.link_table(df_left, df_right, left_on, right_on)

Errors you may get

  1. ZeroDivisionError: float division by zero---> Refer to this link to resolve it
  2. OperationalError: No Such Module:fts4 --> downlaod the sqlite3.dll from here and replace the DLL file in your python or anaconda DLLs folder.

Pros :

  1. Works faster. In my case, I compared one dataframe with 3000 rows with anohter dataframe with 170,000 records . This also uses SQLite3 search across text. So faster than many
  2. Can check across multiple columns and 2 dataframes. In my case, I was looking for closest match based on address and company name. Sometimes, company name might be same but address is the good thing to check too.
  3. Gives you score for all the closest matches for the same record. you choose whats the cutoff score.

cons:

  1. Original package installation is buggy
  2. Required C++ and visual studios installed too
  3. Wont work for 64 bit anaconda/Python
reddy
  • 151
  • 3
  • 11
  • Thanks reddy... currently running this on a dataset with 6000 rows matched up against a dataset with 3 million rows, and praying... Do you think this will run faster than fuzzywuzzy? – Parseltongue Mar 20 '20 at 01:42
  • 1
    Hi @Parseltongue: This data is huge in your case. I dont think any fuzzywuzzy seems to be efficient against more than a million, But you can definitely give it a try for this one. I ran 6000 rows against 0.8 million rows and was pretty good. – reddy Apr 19 '20 at 19:42
  • How to install the sqlite model? Tried all possible options - still does not work :( – Anakin Skywalker Dec 06 '21 at 13:28
  • 1
    @AnakinSkywalker sqlite module is builtin python so you don't need to install! – RobinL Dec 07 '21 at 13:44
  • @reddy I havent been able to figure out the zero float division error. I am getting it in after installing in colab with pip, could you please help me out? – Vikranth Apr 26 '22 at 08:09
5

There is a package called fuzzy_pandas that can use levenshtein, jaro, metaphone and bilenco methods. With some great examples here

import pandas as pd
import fuzzy_pandas as fpd

df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})
df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})

results = fpd.fuzzy_merge(df1, df2,
            left_on='Key',
            right_on='Key',
            method='levenshtein',
            threshold=0.6)

results.head()

  Key    Key
0 Apple  Aple
1 Banana Bannanna
2 Orange Orag
camnesia
  • 2,143
  • 20
  • 26
3

Using thefuzz

Using SeatGeek's great package thefuzz, which makes use of Levenshtein distance. This works with data held in columns. It adds matches as rows rather than columns, to preserve a tidy dataset, and allows additional columns to be easily pulled through to the output dataframe.


Sample data

df1 = pd.DataFrame({'col_a':['one','two','three','four','five'], 'col_b':[1, 2, 3, 4, 5]})

    col_a   col_b
0   one     1
1   two     2
2   three   3
3   four    4
4   five    5

df2 = pd.DataFrame({'col_a':['one','too','three','fours','five'], 'col_b':['a','b','c','d','e']})

    col_a   col_b
0   one     a
1   too     b
2   three   c
3   fours   d
4   five    e

Function used to do the matching

def fuzzy_match(
    df_left, df_right, column_left, column_right, threshold=90, limit=1
):
    # Create a series
    series_matches = df_left[column_left].apply(
        lambda x: process.extract(x, df_right[column_right], limit=limit)            # Creates a series with id from df_left and column name _column_left_, with _limit_ matches per item
    )

    # Convert matches to a tidy dataframe
    df_matches = series_matches.to_frame()
    df_matches = df_matches.explode(column_left)     # Convert list of matches to rows
    df_matches[
        ['match_string', 'match_score', 'df_right_id']
    ] = pd.DataFrame(df_matches[column_left].tolist(), index=df_matches.index)       # Convert match tuple to columns
    df_matches.drop(column_left, axis=1, inplace=True)      # Drop column of match tuples

    # Reset index, as in creating a tidy dataframe we've introduced multiple rows per id, so that no longer functions well as the index
    if df_matches.index.name:
        index_name = df_matches.index.name     # Stash index name
    else:
        index_name = 'index'        # Default used by pandas
    df_matches.reset_index(inplace=True)
    df_matches.rename(columns={index_name: 'df_left_id'}, inplace=True)       # The previous index has now become a column: rename for ease of reference

    # Drop matches below threshold
    df_matches.drop(
        df_matches.loc[df_matches['match_score'] < threshold].index,
        inplace=True
    )

    return df_matches

Use function and merge data

import pandas as pd
from thefuzz import process

df_matches = fuzzy_match(
    df1,
    df2,
    'col_a',
    'col_a',
    threshold=60,
    limit=1
)

df_output = df1.merge(
    df_matches,
    how='left',
    left_index=True,
    right_on='df_left_id'
).merge(
    df2,
    how='left',
    left_on='df_right_id',
    right_index=True,
    suffixes=['_df1', '_df2']
)

df_output.set_index('df_left_id', inplace=True)       # For some reason the first merge operation wrecks the dataframe's index. Recreated from the value we have in the matches lookup table

df_output = df_output[['col_a_df1', 'col_b_df1', 'col_b_df2']]      # Drop columns used in the matching
df_output.index.name = 'id'

id  col_a_df1   col_b_df1   col_b_df2
0   one         1           a
1   two         2           b
2   three       3           c
3   four        4           d
4   five        5           e

Tip: Fuzzy matching using thefuzz is much quicker if you optionally install the python-Levenshtein package too.

philipnye
  • 352
  • 1
  • 3
  • 16
2

As a heads up, this basically works, except if no match is found, or if you have NaNs in either column. Instead of directly applying get_close_matches, I found it easier to apply the following function. The choice of NaN replacements will depend a lot on your dataset.

def fuzzy_match(a, b):
    left = '1' if pd.isnull(a) else a
    right = b.fillna('2')
    out = difflib.get_close_matches(left, right)
    return out[0] if out else np.NaN
Luke
  • 6,699
  • 13
  • 50
  • 88
2

You can use d6tjoin for that

import d6tjoin.top1
d6tjoin.top1.MergeTop1(df1.reset_index(),df2.reset_index(),
       fuzzy_left_on=['index'],fuzzy_right_on=['index']).merge()['merged']

index number index_right letter 0 one 1 one a 1 two 2 too b 2 three 3 three c 3 four 4 fours d 4 five 5 five e

It has a variety of additional features such as:

  • check join quality, pre and post join
  • customize similarity function, eg edit distance vs hamming distance
  • specify max distance
  • multi-core compute

For details see

citynorman
  • 4,918
  • 3
  • 38
  • 39
  • Just tested this, it gives me weird results back, for example it matched `government` with `business`, is there a way to configure the threshold for the matching score? – Erfan May 26 '19 at 16:28
  • Yes see [reference docs](https://d6tjoin.readthedocs.io/en/latest/d6tjoin.html#module-d6tjoin.top1) you can pass `top_limit` and might also want to change `fun_diff` to `fun_diff=[affinegap.affineGapDistance]` which tends to give better matches. – citynorman May 29 '19 at 03:16
2

I have used fuzzywuzz in a very minimal way whilst matching the existing behaviour and keywords of merge in pandas.

Just specify your accepted threshold for matching (between 0 and 100):

from fuzzywuzzy import process

def fuzzy_merge(df, df2, on=None, left_on=None, right_on=None, how='inner', threshold=80):
    
    def fuzzy_apply(x, df, column, threshold=threshold):
        if type(x)!=str:
            return None
        
        match, score, *_ = process.extract(x, df[column], limit=1)[0]
            
        if score >= threshold:
            return match

        else:
            return None
    
    if on is not None:
        left_on = on
        right_on = on

    # create temp column as the best fuzzy match (or None!)
    df2['tmp'] = df2[right_on].apply(
        fuzzy_apply, 
        df=df, 
        column=left_on, 
        threshold=threshold
    )

    merged_df = df.merge(df2, how=how, left_on=left_on, right_on='tmp')
    
    del merged_df['tmp']
    
    return merged_df

Try it out using the example data:

df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})

df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})

fuzzy_merge(df, df2, on='Key', threshold=80)
Matt
  • 1,196
  • 1
  • 9
  • 22
  • Instead of process.extract with a limit of 1, you can directly use process.extractOne, which only extracts the best match. – maxbachmann Mar 27 '21 at 21:20
0

For more complex use cases to match rows with many columns you can use recordlinkage package. recordlinkage provides all the tools to fuzzy match rows between pandas data frames which helps to deduplicate your data when merging. I have written a detailed article about the package here

Bex T.
  • 1,062
  • 1
  • 12
  • 28
0

if the join axis is numeric this could also be used to match indexes with a specified tolerance:

def fuzzy_left_join(df1, df2, tol=None):
    index1 = df1.index.values
    index2 = df2.index.values

    diff = np.abs(index1.reshape((-1, 1)) - index2)
    mask_j = np.argmin(diff, axis=1)  # min. of each column
    mask_i = np.arange(mask_j.shape[0])

    df1_ = df1.iloc[mask_i]
    df2_ = df2.iloc[mask_j]

    if tol is not None:
        mask = np.abs(df2_.index.values - df1_.index.values) <= tol
        df1_ = df1_.loc[mask]
        df2_ = df2_.loc[mask]

    df2_.index = df1_.index

    out = pd.concat([df1_, df2_], axis=1)
    return out
mccandar
  • 778
  • 8
  • 16
0

TheFuzz is the new version of a fuzzywuzzy

In order to fuzzy-join string-elements in two big tables you can do this:

  1. Use apply to go row by row
  2. Use swifter to parallel, speed up and visualize default apply function (with colored progress bar)
  3. Use OrderedDict from collections to get rid of duplicates in the output of merge and keep the initial order
  4. Increase limit in thefuzz.process.extract to see more options for merge (stored in a list of tuples with % of similarity)

'*' You can use thefuzz.process.extractOne instead of thefuzz.process.extract to return just one best-matched item (without specifying any limit). However, be aware that several results could have same % of similarity and you will get only one of them.

'**' Somehow the swifter takes a minute or two before starting the actual apply. If you need to process small tables you can skip this step and just use progress_apply instead

from thefuzz import process
from collections import OrderedDict
import swifter    


def match(x):
    matches = process.extract(x, df1, limit=6)
    matches = list(OrderedDict((x, True) for x in matches).keys())
    print(f'{x:20} : {matches}')

    return str(matches)


df1 = df['name'].values
df2['matches'] = df2['name'].swifter.apply(lambda x: match(x))
John Zwinck
  • 239,568
  • 38
  • 324
  • 436
George
  • 21
  • 6
0

I've found this very efficient. The details in the description of the function:

from fuzzywuzzy.process import extract

def efficient_matching(df1,
                       col1,
                       df2,
                       col2,
                       limit=3,
                       length_diff=3,
                       first_letter_match=2
                       ):
    """
    For each name that we want to find matches for, it's more efficient to only look at a subset of potential matches. 
    One way to narrow down all the matches to potential matches is length. Here are 2 methods:
    1. If the name is "Markos", we don't need to check how similar markos is to names with length less than 4 or 
    more than 8. This window is determined by length_diff.
    2. We consider names from the corpus whose first 2 letters are similar the first letters of the name we want to find
    the match for.

    limit: Gives how many closest matches to return.
    """
    df1[col1] = df1[col1].astype(str)
    df2[col2] = df2[col2].astype(str)

    df1['_len_'] = df1[col1].apply(len)
    df2['_len_'] = df2[col2].apply(len)
    df2 = df2[df2['_len_'] >= 2]

    matches = df1[[col1, '_len_']].apply(lambda x: 
                                        extract(x[0], 
                                                df2[
                                                    ((df2['_len_'] - x[1]).abs() < length_diff) &
                                                    (df2[col2].str[:first_letter_match]==x[0][:first_letter_match])
                                                    ][col2].tolist(), 
                                                limit = limit
                                                ), 
                                        axis=1
                                        ) 
    return matches
Mehdi Rostami
  • 139
  • 2
  • 5