Combine datasets - different sizes and need one to repeat based on ID in python

Question

I have two datasets I need to combine in pandas. One includes information for a bunch of invoices and two includes supplier information. I've included a simplified example below.

Basically, I want to combine the supplier data with the invoice data on supplier ID so that each invoice has the home country of the supplier from the supplier info data.

I've tried merging and joining in different ways (inner, left, outer) and it did not repeat based on duplicate supplier IDs like I needed it to. I'm having trouble figuring out how to repeat observations. Example, Company A may appear in 2+ invoices so we need country to populate on each, but with merging it only will populate on one. The rest will be NaN. Can anyone help?

Data example:

d1 = {'company id': [11, 12, 11, 11, 13, 12, 14], 
     'company': ["Company A", "Company B", "Company A", "Company A", "Company C", "Company B", "Company D"], 
    'USD': [120000, 20000, 4000000, 99000, 8000, 110000, 443000], 'Invoice no': [1, 2, 3, 4, 5, 6, 7]} 
invoices = pd.DataFrame(data=d1)

d2 = {'company id': [11, 12, 13, 14], 
     'company': ["Company A", "Company B", "Company C", "Company D"], 
    'Country': ["USA", "USA", "AUS", "MEX"]} 
supplier = pd.DataFrame(data=d2)

`invoices.merge(supplier.drop('company',1),on='company id')` ? — anky, Mar 06 '19 at 16:32
What does your expected output dataframe look like for this data? — Scott Boston, Mar 06 '19 at 16:37

score 1 · Accepted Answer · answered Mar 06 '19 at 16:39

Use map with pd.Series:

invoices['Country'] = invoices['company id'].map(supplier.set_index('company id')['Country'])

invoices

Output:

   company id    company      USD  Invoice no Country
0          11  Company A   120000           1     USA
1          12  Company B    20000           2     USA
2          11  Company A  4000000           3     USA
3          11  Company A    99000           4     USA
4          13  Company C     8000           5     AUS
5          12  Company B   110000           6     USA
6          14  Company D   443000           7     MEX

Details, first create a series that maps company id to country with company id in the index and country as values of the pd.Series Then use pd.Series.map to make every value of company id in the invoices dataframe to a country provided by the pd.Series defined.

What is the benefit of this over `merge`? Our output is exactly the same — Erfan, Mar 06 '19 at 16:53
Speed! This only getting a single column of the original datafame. With merge you working with many columns and etc... if you are only "merging" a single column, many times map is more efficient. — Scott Boston, Mar 06 '19 at 16:55

score 0 · Answer 2 · answered Mar 06 '19 at 16:33

Not sure where you get NaN but a merge works for me.

merged = pd.merge(invoices, supplier, on=['company id', 'company'], how='left')
print(merged)
   company id    company      USD  Invoice no Country
0          11  Company A   120000           1     USA
1          12  Company B    20000           2     USA
2          11  Company A  4000000           3     USA
3          11  Company A    99000           4     USA
4          13  Company C     8000           5     AUS
5          12  Company B   110000           6     USA
6          14  Company D   443000           7     MEX

Combine datasets - different sizes and need one to repeat based on ID in python

2 Answers2