I am trying to implement a kind of join functionality on 2 DataFrames in Python3.5, the only difference in logic is, it should provide only first match as the output and ignore the rest of the matching records.
E.g. below -
Input data frame -
ID | Name
1 | Sanket
1 | Sanket
1 | Sanket
2 | Robin
2 | Robin
3 | Vishal
3 | Vishal
Lookup data frame -
ID | Name
1 | Sanket
2 | Robin
3 | Vishal
4 | Umesh
My Join condition is Lookup.name > Input.name
So for Input.ID == 1, there are 2 matches in the lookup "Vishal" and "Umesh", it should provide only the first match i.e. "Vishal" and ignore the rest of the matching records.
for Input.ID == 2, there are 3 matches "Sanket", "Vishal" and "Umesh", it should only provide the first match i.e. "Sanket" and ignore the rest
The output should be-
Input.ID | Input.Name | Lookup.ID | Lookup.Name
1 | Sanket | 3 | Vishal
1 | Sanket | 3 | Vishal
1 | Sanket | 3 | Vishal
2 | Robin | 1 | Sanket
2 | Robin | 1 | Sanket
3 | Vishal | NaN | NaN
3 | Vishal | NaN | NaN
I have majorly tried below 2 things and can provide the code if needed -
- pandas.DataFrame.merge = Here I can not use ">" condition as it only supports "=" condition on 2 columns. With the "=" condition, I am getting the expected output
- Writing a SQL query using pandasql with some nested joins and logic but it doesn't provide the expected output. here's the query that I am executing -
select IP.*, Lookup.[ID] as [lookup_ID],Lookup.[Name] as [lookup_Name]
from InputDF IP
LEFT join
(
select Lookup.* from InputDF IP LEFT join LookupDF Lookup on [Lookup].[Name] > [IP].[Name] limit 1
)
Lookup
on [Lookup].[Name] > [IP].[Name]
In the query, because of the "limit1" clause, it gives "Vishal" as a lookup.Name for all the input rows.
To summarize - I am trying to get it worked as an excel Vlookup function, only with the ">" condition.
Can someone please help and suggest some approach to get this working?
Thank you! Sanket Kelkar