I have the following Pandas data frame in 'tidy' (long) format:
df = pd.DataFrame({'Manufacturer':['Ford', 'Mercedes', 'BMW', 'Ford', 'Mercedes', 'BMW', 'Ford', 'Mercedes', 'BMW', 'Ford', 'Mercedes', 'BMW', 'Ford', 'Mercedes', 'BMW', 'Ford', 'Mercedes', 'BMW'],
'Metric':['Orders', 'Orders', 'Orders', 'Orders', 'Orders', 'Orders', 'Sales', 'Sales', 'Sales', 'Sales', 'Sales', 'Sales', 'Warranty', 'Warranty', 'Warranty', 'Warranty', 'Warranty', 'Warranty'],
'Sector':['Germany', 'Germany', 'Germany', 'USA', 'USA', 'USA', 'Germany', 'Germany', 'Germany', 'USA', 'USA', 'USA', 'Germany', 'Germany', 'Germany', 'USA', 'USA', 'USA'],
'Value':[45000, 70000, 65000, 40000, 65000, 63000, 2700, 4400, 3400, 3000, 4700, 5700, 1500, 2000, 2500, 1300, 2000, 2450],
})
I'm trying to compute the ratio between Orders to Sales, broken down by Manufacturer and Sector.
I've tried a few approaches (e.g. groupby, transform, etc.) based on the answers provided in Pandas percentage of total with groupby, but was unsuccessful with any of these (likely due to the data being structured a bit differently, and the lack of filtering required).
The ultimate goal is to make a bar plot displaying the Orders to Sales ratio broken down by Sector and Manufacturer, similar to the following:
Question: how would I compute the ratio between Orders to Sales (broken down by Manufacturer and Sector) as a means to plot the result?
Thanks in advance!
/////////////////////////////////////////////////// UPDATE/////////////////////////////////////////////
If we amend the data frame to include another column which results in duplicate data in one of the indices, we get an error.
Here is the updated data frame with a new City
column:
test_df = pd.DataFrame({'Manufacturer':['Ford', 'Ford', 'Mercedes', 'BMW', 'Ford', 'Mercedes', 'BMW', 'Ford', 'Mercedes', 'BMW', 'Ford', 'Mercedes', 'BMW', 'Ford', 'Mercedes', 'BMW', 'Ford', 'Mercedes', 'BMW'],
'Metric':['Orders', 'Orders', 'Orders', 'Orders', 'Orders', 'Orders', 'Orders', 'Sales', 'Sales', 'Sales', 'Sales', 'Sales', 'Sales', 'Warranty', 'Warranty', 'Warranty', 'Warranty', 'Warranty', 'Warranty'],
'Sector':['Germany', 'Germany', 'Germany', 'Germany', 'USA', 'USA', 'USA', 'Germany', 'Germany', 'Germany', 'USA', 'USA', 'USA', 'Germany', 'Germany', 'Germany', 'USA', 'USA', 'USA'],
'Value':[45000, 70000, 90000, 65000, 40000, 65000, 63000, 2700, 4400, 3400, 3000, 4700, 5700, 1500, 2000, 2500, 1300, 2000, 2450],
'City': ['Frankfurt', 'Bremen', 'Berlin', 'Hamburg', 'New York', 'Chicago', 'Los Angeles', 'Dresden', 'Munich', 'Cologne', 'Miami', 'Atlanta', 'Phoenix', 'Nuremberg', 'Dusseldorf', 'Leipzig', 'Houston', 'San Diego', 'San Francisco']
})
Running the following:
s1=df.set_index(['Manufacturer','Sector']).query("Metric=='Orders'").Value
s2=df.set_index(['Manufacturer','Sector']).query("Metric=='Sales'").Value
s1.div(s2).unstack().plot(kind='bar')
Results in the following error:
`ValueError: Index contains duplicate entries, cannot reshape`
The reason is because of these duplicates:
Manufacturer Metric Sector Value City
Ford Orders Germany 45000 Frankfurt
Ford Orders Germany 70000 Bremen
I tried to treat this use case with the following from Pandas unstack problems: ValueError: Index contains duplicate entries, cannot reshape:
df.reset_index().pivot_table(values = 'Value', index = ['Manufacturer', 'Metric'], aggfunc='sum')
But, I was unsuccessful and still get the error message about duplicate values.
Does anyone have any ideas about how to solve this?
Thanks in advance!