1

I am trying to implement a kind of join functionality on 2 DataFrames in Python3.5, the only difference in logic is, it should provide only first match as the output and ignore the rest of the matching records.

E.g. below -

Input data frame -

ID | Name
1 | Sanket
1 | Sanket
1 | Sanket
2 | Robin
2 | Robin
3 | Vishal
3 | Vishal  

Lookup data frame -

ID | Name
1 | Sanket
2 | Robin
3 | Vishal
4 | Umesh

My Join condition is Lookup.name > Input.name

So for Input.ID == 1, there are 2 matches in the lookup "Vishal" and "Umesh", it should provide only the first match i.e. "Vishal" and ignore the rest of the matching records.

for Input.ID == 2, there are 3 matches "Sanket", "Vishal" and "Umesh", it should only provide the first match i.e. "Sanket" and ignore the rest

The output should be-

Input.ID | Input.Name | Lookup.ID | Lookup.Name
1 | Sanket | 3 | Vishal
1 | Sanket | 3 | Vishal
1 | Sanket | 3 | Vishal
2 | Robin | 1 | Sanket
2 | Robin | 1 | Sanket
3 | Vishal | NaN | NaN
3 | Vishal | NaN | NaN

I have majorly tried below 2 things and can provide the code if needed -

  1. pandas.DataFrame.merge = Here I can not use ">" condition as it only supports "=" condition on 2 columns. With the "=" condition, I am getting the expected output
  2. Writing a SQL query using pandasql with some nested joins and logic but it doesn't provide the expected output. here's the query that I am executing -
select IP.*, Lookup.[ID] as [lookup_ID],Lookup.[Name] as [lookup_Name]
from InputDF IP
LEFT join
(
select  Lookup.* from InputDF IP LEFT join LookupDF Lookup on [Lookup].[Name] > [IP].[Name] limit 1
)
Lookup
on [Lookup].[Name] > [IP].[Name]

In the query, because of the "limit1" clause, it gives "Vishal" as a lookup.Name for all the input rows.

To summarize - I am trying to get it worked as an excel Vlookup function, only with the ">" condition.

Can someone please help and suggest some approach to get this working?

Thank you! Sanket Kelkar

Sanket Kelkar
  • 129
  • 2
  • 9

1 Answers1

0

Here's an approach you can try.

First, perform a cross join, or cartesian product, on the two dataframes

def cartesian_product_basic(left, right):
    return (
       left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1))


df_combined = cartesian_product_basic(df_input, df_lookup)

Rename the columns

df_combined.columns = ["input_id", "input_name", "lookup_id", "lookup_name"]

Properly pair the lookup rows with the input rows, using a dense rank.

df_combined = df_combined.loc[df_combined["lookup_name"] > df_combined["input_name"]]
df_combined.loc[:, "row_rank"] = (
    df_combined.groupby(["input_id", "input_name"])['lookup_id'].rank(method='dense')
)
df_combined = df_combined.loc[df_combined["row_rank"] == 1]

Add back in rows which have no names greater, like Vishal in this case.

df_missing = df_input.merge(
    df_combined,
    how='left',
    left_on=['ID'],
    right_on=["input_id"],
    indicator=True
)

df_to_append = df_missing.loc[df_missing["_merge"] == "left_only", ["ID", "Name"]]
df_to_append.columns = ["input_id", "input_name"]
df_output = pd.concat([df_combined, df_to_append])

Clean up output dataframe

df_output.drop(columns="row_rank", inplace=True)
df_output.reset_index(drop=True, inplace=True)
print(df_output)

Output

   input_id input_name  lookup_id lookup_name
0         1     Sanket        3.0      Vishal
1         1     Sanket        3.0      Vishal
2         1     Sanket        3.0      Vishal
3         2      Robin        1.0      Sanket
4         2      Robin        1.0      Sanket
5         3     Vishal        NaN         NaN
6         3     Vishal        NaN         NaN

Note: lookup_id column displays as float because of the presence of NaN values.

Kurt Kline
  • 1,724
  • 1
  • 10
  • 23
  • The cartesian product function used here is from a separate StackOverflow post: https://stackoverflow.com/a/53699013/8785917 – Kurt Kline Oct 28 '20 at 10:07