Pandas Left Join results in more rows than the Left Dataframe

Question

I have more rows in the resultant left join than are in the left dataframe.

# Importing Pandas and changing it's call to pd
import numpy as np
import pandas as pd

SalesDF = pd.read_csv(r"C:\Users\USER\Documents\Reports\SalesForAnalysis.csv")
print("This is the Sales shape")
print(SalesDF.shape)


CustInfoDF = pd.read_csv(r"C:\Users\USER\Documents\Cust.csv")

# This reassigns the df so that the rows with a NaN in the Account Number it  doesn't appear
CustInfoDF = CustInfoDF[CustInfoDF['Account Number'].notna()]


# Merges the two dataframes on SalesDF with "Cust Number" as the key
MergeDF = pd.merge(SalesDF, CustInfoDF, how="left", left_on="Cust Number", right_on="Account Number")

print("This is the Merge Shape ")
print(MergeDF.shape)

# Reduced the number of columns to the selected columns
CutDF = MergeDF[["Customer", "Invoice #", "E-mail Address", "Phone", "Clerk", "Total", "Date"]]

CutDF.drop_duplicates()

print("This is the Cut shape ")
print(CutDF.shape)

Here is the result after running the program

This is the Sales shape
(5347, 61)
This is the Merge Shape 
(6428, 83)
This is the Cut shape 
(6428, 7)

Process finished with exit code 0

The CutDF should only have a max of 5347 rows. I have a drop_duplicates method in there but I still get the same results.

I saw this pandas left join - why more results? and inner join/merge in pandas dataframe give more rows than left dataframe but I didn't really see a solution to this in these though.

Any help would be appreciated.

I am afraid that I do not understand the question. Could you rephrase please? — BirdBud, Dec 11 '20 at 17:59

Parmandeep Chaddha · Accepted Answer · 2021-04-14T16:58:13.853

4

Before you execute:

MergeDF = pd.merge(SalesDF, CustInfoDF, how="left", left_on="Cust Number", right_on="Account Number")

Can you do:

CustInfoDF = CustInfoDF.drop_duplicates(subset=["Account Number"])

I have a suspicion that your CustInforDF has multiple entries for each Account Number.

If this doesn't work, can you please post the sample data frames? Feel free to add/replace with dummy values as long as the code is replicable.

edited Apr 14 '21 at 16:58

answered Dec 10 '20 at 23:19

Parmandeep Chaddha

454
2
9

You are correct in your assumption that there are redundant account numbers. I will give a more earnest effort to implement this when I have some time. – BirdBud Dec 11 '20 at 17:57
My new resultant message after adding ```CustInfoDF = CustInfoDF.drop_duplicates(on=["Account Number"])``` before my MergeDF is formed ``` This is the Sales shape (5104, 62) Traceback (most recent call last): File "C:\Users\USER\PycharmProjects\PandasStuff\PyMailReport.py", line 18, in CustInfoDF = CustInfoDF.drop_duplicates(on=["Account Number"]) TypeError: drop_duplicates() got an unexpected keyword argument 'on' Process finished with exit code 1``` – BirdBud Dec 11 '20 at 18:22
Final Comment Thank you Parmandeep, The issue is resolved. the "on=" wasn't necessary in the drop_duplicates code and that is what was throwing the program but now it seems to be working perfectly. Thank you very much – BirdBud Dec 11 '20 at 18:38
Ah yes, sorry the `on` should be `subset`. Just updating in case people use this in the future. – Parmandeep Chaddha Apr 14 '21 at 16:58

Pandas Left Join results in more rows than the Left Dataframe

1 Answers1