Join two large CSV's without duplicating in Python Pandas (or similar), much like using VLOOKUP on just the first dataframe

Question

I have a data set (Data Set 1) of 3425 lines long, it has approximately 600 "Part Numbers" that are unique. Data Set 2 has a list of all of these part numbers, and more (as some aren't present in data set 1), with some corresponding run-time data. These do not repeat.

No matter what method I choose, I cannot get it to not duplicate in some way. All I want it to do, is look at line 1 in Data Set 1, and find the 3 columns of data in Data Set 2, and add it as 3 more columns to the right.

For example (This is a very basic idea)

Data Set 1:

Part Number Quantity  Person
aaa            1        JK
bbb            2        PM
ccc            1        BJ
ddd            3        LL
aaa            999      ZZ

Data Set 2:

Part Number Typical Material Cats/Dogs
aaa            Nylon             Cat
bbb            Cheese            Dog
ccc            Titanium          Cat
ddd            Cardboard         Dog
eee            Mouse             Cat

Result:

Part Number Quantity  Person Typical Material Cats/Dogs
aaa            1        JK      Nylon             Cat
bbb            2        PM      Cheese            Dog
ccc            1        BJ      Titanium          Cat
ddd            3        LL      Cardboard         Dog
aaa            999      ZZ      Nylon             Cat

So it can completely ignore eee as it was not required in Data Set 1, and appends the data for what I have.

I have tried numerous things to get it to work, and more that aren't below as they've been deleted. All variations of merge and concat, as well as update() and some others I've forgotten. I've tried some loops, and searched stack overflow, google, etc. All have similar ideas, but nothing actually works.

pulsesCSV = pd.read_csv("C:\location")
#pulsesCSV.set_index('Part Number') - Used for 
rawDataCSV = pd.read_csv("C:\location")
#rawDataCSV.set_index('Part Number')

#df = rawDataCSV.merge(pulsesCSV, on='Part Number')
#df = pd.DataFrame(df[df.index_x==df.index_y]['Part Number'], columns=['Part Number']).reset_index(drop=True)


# Join the tables on the part number
#jointTable = pd.merge(pulsesCSV,rawDataCSV,on='Part Number')
#jointTable = pd.merge(rawDataCSV,pulsesCSV,on='Part Number',how='outer')
#jointTable = pd.concat([pulsesCSV,rawDataCSV],axis=1,join='inner')
#jointTable = rawDataCSV.combine_first(pulsesCSV)
#jointTable = pulsesCSV.combine_first(rawDataCSV)
#jointTable = rawDataCSV.join(pulsesCSV,on='Part Number',how='inner')



#export_csv = jointTable.to_csv(r"")

This is typical `Merge` problem as tagged by @Datanovice, the reason you are getting duplicates, is because you have duplicate `Part Numbers`. If you want to understand how merging works better. Google about `joining data` and understand the concept of `key column` and the types of joins: `inner, left, right, etc` — Erfan, May 08 '19 at 14:48

score 0 · Accepted Answer · answered May 08 '19 at 14:26

0

Assuming your dataframes are called df and df1

res = pd.merge(df,df1,on='Part',how='left)

doing a left join will only keep the keys from the left frame.

answered May 08 '19 at 14:26

Umar.H

22,559
7
39
74

I have tried this, and it increases my row count from 3425 to 3863. – Voodoo Sciene May 08 '19 at 14:43
do you have duplicate entries in your right df ? – Umar.H May 08 '19 at 14:46
I was about to reply no, as I was certain I didn't, but checked, and turned out I did in the right df. That was the issue. Incredibly stupid oversight. The aforementioned left join worked fine now with **NO** duplicates in the right data set! – Voodoo Sciene May 09 '19 at 07:06
Best of luck mate! joins took me a while to figure out, use the above link by Mr coldspeed for a thorough intro or check out sorcatica on YouTube – Umar.H May 09 '19 at 07:15

Join two large CSV's without duplicating in Python Pandas (or similar), much like using VLOOKUP on just the first dataframe

1 Answers1