import numpy as np
import pandas as pd
results = pd.DataFrame({'Contractor':[1,1,0,0,0,1],
'President':[1,0,0,0,1,1],
'Item 1':[1,1,0,0,1,np.nan],
'Item 2':[1,0,0,1,0,1]})
reference = pd.DataFrame({'Position':['Contractor','President'],
'Item(s)':[(1,), (1,2)]})
longref = pd.DataFrame([('Item {}'.format(item), row['Position'])
for index, row in reference.iterrows()
for item in row['Item(s)']], columns=['Item', 'Position'])
melted = pd.melt(results, id_vars=['Item 1','Item 2'], var_name='Position')
melted = melted.loc[melted['value']==1]
melted = pd.melt(melted, id_vars=['Position'],
value_vars=['Item 1','Item 2'], var_name='Item')
merged = pd.merge(longref, melted, how='left')
grouped = merged.groupby(['Position'])
result = (grouped['value'].sum() / grouped['value'].count())*100
result = result.rename('Overall%').reset_index()
print(result)
yields
Position Overall%
0 Contractor 100.0
1 President 80.0
Explanation: There is an article by Hadley Wickham
(PDF) propounding the advantages
of making data "tidy". The main tenet is that each row should
represent an "observation" and each column represent some factor or variable.
It frequently turns out that the tools you will need to express your calculation
will fall into place quite naturally once the data is tidy.
The difficulty of this problem largely comes from the data not being tidy.
Consider results
:
In [405]: results
Out[405]:
Contractor Item 1 Item 2 President
0 1 1.0 1 1
1 1 1.0 0 0
2 0 0.0 0 0
3 0 0.0 1 0
4 0 1.0 0 1
5 1 NaN 1 1
Instead of having separate columns for Contractor
and President
, it would be nicer to have one column called Position
, since Position
is the variable, and each observation or row can have one value for Position
-- either a Contractor
or a President
.
Similarly, Item 1
and Item 2
should be coalesced into a single column Item
:
In [416]: melted
Out[416]:
Position Item value
0 Contractor Item 1 1.0
1 Contractor Item 1 1.0
2 Contractor Item 1 NaN
3 President Item 1 1.0
4 President Item 1 1.0
5 President Item 1 NaN
6 Contractor Item 2 1.0
7 Contractor Item 2 0.0
8 Contractor Item 2 1.0
9 President Item 2 1.0
10 President Item 2 0.0
11 President Item 2 1.0
melted
contains the same information as results
, but in a tidy format. The value
column contains the values in results[['Item 1', 'Item 2']]
. Each row corresponds to an "observation" where either results['Contractor']
or result['President']` equals 1, since the calculation's logic only requires these values.
Similarly, instead of
In [407]: reference
Out[407]:
Item(s) Position
0 (1,) Contractor
1 (1, 2) President
it would be tidier to have a DataFrame whose columns are Item
and Position
:
In [408]: longref
Out[408]:
Item Position
0 Item 1 Contractor
1 Item 1 President
2 Item 2 President
Once you have the tidy version of your data in the form of melted
and longref
,
calculating the desired result is fairly straight-forward:
merged = pd.merge(longref, melted, how='left')
# Item Position value
# 0 Item 1 Contractor 1.0
# 1 Item 1 Contractor 1.0
# 2 Item 1 Contractor NaN
# 3 Item 1 President 1.0
# 4 Item 1 President 1.0
# 5 Item 1 President NaN
# 6 Item 2 President 1.0
# 7 Item 2 President 0.0
# 8 Item 2 President 1.0
grouped = merged.groupby(['Position'])
result = (grouped['value'].sum() / grouped['value'].count())*100
result = result.rename('Overall%').reset_index()
How to tidy-up reference
to make longref
:
Just iterate through the rows of reference
and for each row iterate through the tuple of items to build the new DataFrame, longref
:
longref = pd.DataFrame([('Item {}'.format(item), row['Position'])
for index, row in reference.iterrows()
for item in row['Item(s)']], columns=['Item', 'Position'])
How to tidy-up results
to make melted
:
It can be done with two calls to pd.melt
. pd.melt
converts "wide" format to "long" format DataFrames. It can coalesce multiple columns into a single column. For example, to coalesce the Contractor and President columns into a single Position column you could use:
melted = pd.melt(results, id_vars=['Item 1','Item 2'], var_name='Position')
# we only care about rows where Contractor or President value was 1. So use .loc to select those rows.
melted = melted.loc[melted['value']==1]
# Item 1 Item 2 Position value
# 0 1.0 1 Contractor 1
# 1 1.0 0 Contractor 1
# 5 NaN 1 Contractor 1
# 6 1.0 1 President 1
# 10 1.0 0 President 1
# 11 NaN 1 President 1
and similarly, to coalesce the Item 1
and Item 2
columns into a single Item
column, use:
melted = pd.melt(melted, id_vars=['Position'],
value_vars=['Item 1','Item 2'], var_name='Item')
# Position Item value
# 0 Contractor Item 1 1.0
# 1 Contractor Item 1 1.0
# 2 Contractor Item 1 NaN
# 3 President Item 1 1.0
# 4 President Item 1 1.0
# 5 President Item 1 NaN
# 6 Contractor Item 2 1.0
# 7 Contractor Item 2 0.0
# 8 Contractor Item 2 1.0
# 9 President Item 2 1.0
# 10 President Item 2 0.0
# 11 President Item 2 1.0