2

I have referred to this post but cannot get it to run for my particular case. I have two dataframes:

import pandas as pd

df1 = pd.DataFrame(
    {
        "ein": {0: 1001, 1: 1500, 2: 3000},
        "ein_name": {0: "H for Humanity", 1: "Labor Union", 2: "Something something"},
        "lname": {0: "Cooper", 1: "Cruise", 2: "Pitt"},
        "fname": {0: "Bradley", 1: "Thomas", 2: "Brad"},
    }
)

df2 = pd.DataFrame(
    {
        "lname": {0: "Couper", 1: "Cruise", 2: "Pit"},
        "fname": {0: "Brad", 1: "Tom", 2: "Brad"},
        "score": {0: 3, 1: 3.5, 2: 4},
    }
)

Then I do:

from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from itertools import product

N = 60
names = {
    tup: fuzz.ratio(*tup)
    for tup in product(df1["lname"].tolist(), df2["lname"].tolist())
}

s1 = pd.Series(names)
s1 = s1[s1 > N]
s1 = s1[s1.groupby(level=0).idxmax()]

degrees = {
    tup: fuzz.ratio(*tup)
    for tup in product(df1["fname"].tolist(), df2["fname"].tolist())
}

s2 = pd.Series(degrees)
s2 = s2[s2 > N]
s2 = s2[s2.groupby(level=0).idxmax()]

df2["lname"] = df2["lname"].map(s1).fillna(df2["lname"])
df2["fname"] = df2["fname"].map(s2).fillna(df2["fname"])
df = df1.merge(df2, on=["lname", "fname"], how="outer")

The result is not what I expect. Can you help me with editing this code please? Note that I have millions of lines in df1 and millions in df2, so I need some efficiency as well.

Basically, I need to match people from df1 to people in df2. In the above example, I am matching them on last name (lname) and first name (fname). I also have a third one, which I leave out here for parsimony.

The expected result should look like:

ein ein_name    lname   fname   score
0   1001    H for Humanity  Cooper  Bradley 3
1   1500    Labor Union Cruise  Thomas  3.5
2   3000    Something something Pitt    Brad    4

1 Answers1

0

You could try this:

from functools import cache

import pandas as pd
from fuzzywuzzy import fuzz

# First, define indices and values to check for matches
indices_and_values = [(i, value) for i, value in enumerate(df2["lname"] + df2["fname"])]

# Define helper functions to find matching rows and get corresponding score
@cache
def find_match(x):
    return [i for i, value in indices_and_values if fuzz.ratio(x, value) > 75]

def get_score(x):
    try:
        return df2.loc[x[0], "score"]
    except (KeyError, IndexError):
        return pd.NA

# Add scores to df1:
df1["score"] = (
    (df1["lname"] + df1["fname"])
    .apply(find_match)
    .apply(get_score)
)

And then:

print(df1)

    ein             ein_name   lname    fname  score
0  1001       H for Humanity  Cooper  Bradley    3.0
1  1500          Labor Union  Cruise   Thomas    3.5
2  3000  Something something    Pitt     Brad    4.0

Given the size of your dataframes, I suppose you have namesakes (identical first and last names), hence the use of @cache decorator from Python standard library in order to try speeding things up (but you can do without it).

Laurent
  • 12,287
  • 7
  • 21
  • 37
  • Thank you Laurent. My df2 has about 15m lines, and df1 has about 1m. Pardon my total ignorance, but is the above optimal for such large data? – Umar Boodoo Dec 26 '21 at 10:56
  • To improve performance, I've revised my answer in order to get rid of intermediate assignment (chaining method calls), calculate reference values once and not at each iteration, and cache previous results. As far as Pandas is concerned, that's the most idiomatic way I know to do it. – Laurent Dec 26 '21 at 15:59
  • Hello @laurent. how would you change the change to achieve two things: 1. Get the fuzz ratio for fname and the fuzz ratio for lname in respective columns added to the database df1 2. Have distinct fuzz ratios for fname and lname, e.g. fuzz_ratio >60 for fname, and fuzz_ratio >75 for lname – Umar Boodoo Mar 02 '22 at 13:24
  • Hi @UmarBoodoo, these are quite different problems. As per Stack Overflow policy, please consider (1) accepting this answer as it solves the problem stated in your post and (2) posting a new, separate question. Cheers. – Laurent Mar 02 '22 at 18:52
  • Thanks @laurent. Answer accepted and new question created. – Umar Boodoo Mar 03 '22 at 10:02