0

I have two datasets I need to combine in pandas. One includes information for a bunch of invoices and two includes supplier information. I've included a simplified example below.

Basically, I want to combine the supplier data with the invoice data on supplier ID so that each invoice has the home country of the supplier from the supplier info data.

I've tried merging and joining in different ways (inner, left, outer) and it did not repeat based on duplicate supplier IDs like I needed it to. I'm having trouble figuring out how to repeat observations. Example, Company A may appear in 2+ invoices so we need country to populate on each, but with merging it only will populate on one. The rest will be NaN. Can anyone help?

Data example:

d1 = {'company id': [11, 12, 11, 11, 13, 12, 14], 
     'company': ["Company A", "Company B", "Company A", "Company A", "Company C", "Company B", "Company D"], 
    'USD': [120000, 20000, 4000000, 99000, 8000, 110000, 443000], 'Invoice no': [1, 2, 3, 4, 5, 6, 7]} 
invoices = pd.DataFrame(data=d1)

d2 = {'company id': [11, 12, 13, 14], 
     'company': ["Company A", "Company B", "Company C", "Company D"], 
    'Country': ["USA", "USA", "AUS", "MEX"]} 
supplier = pd.DataFrame(data=d2)
Melissa M
  • 3
  • 2

2 Answers2

1

Use map with pd.Series:

invoices['Country'] = invoices['company id'].map(supplier.set_index('company id')['Country'])

invoices

Output:

   company id    company      USD  Invoice no Country
0          11  Company A   120000           1     USA
1          12  Company B    20000           2     USA
2          11  Company A  4000000           3     USA
3          11  Company A    99000           4     USA
4          13  Company C     8000           5     AUS
5          12  Company B   110000           6     USA
6          14  Company D   443000           7     MEX

Details, first create a series that maps company id to country with company id in the index and country as values of the pd.Series Then use pd.Series.map to make every value of company id in the invoices dataframe to a country provided by the pd.Series defined.

Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • What is the benefit of this over `merge`? Our output is exactly the same – Erfan Mar 06 '19 at 16:53
  • Speed! This only getting a single column of the original datafame. With merge you working with many columns and etc... if you are only "merging" a single column, many times map is more efficient. – Scott Boston Mar 06 '19 at 16:55
0

Not sure where you get NaN but a merge works for me.

merged = pd.merge(invoices, supplier, on=['company id', 'company'], how='left')
print(merged)
   company id    company      USD  Invoice no Country
0          11  Company A   120000           1     USA
1          12  Company B    20000           2     USA
2          11  Company A  4000000           3     USA
3          11  Company A    99000           4     USA
4          13  Company C     8000           5     AUS
5          12  Company B   110000           6     USA
6          14  Company D   443000           7     MEX
Erfan
  • 40,971
  • 8
  • 66
  • 78