Let's say I've got three project categories A, B and C
that can have three different statuses Started, Finished and canceled
. I'm trying to find the percentage of categories that have for example Status = Finished
for all categories. I think the first step should be to display it in a matrix like below:
Input:
Category Status
A Started
A Started
A Finished
A Finished
B Started
B Canceled
B Canceled
C Started
C Finished
Desired Output:
Started Finished Canceled
A 2 2 0
B 1 0 2
C 1 1 0
But I'm struggling to identify the possible statuses that for example A can have, but does not. In this example Canceled = 0
. I've been trying to do this by subsetting a pandas dataframe by unique observations for Category
, and then joining them to fill non-existent combinations with nan
. But I think this would be very slow for bigger datesets. Also, I'm not quite there yet. Here's the code if anyone would like to try and build on it. But I'm suspecting that there's a much more efficient solution out there somewhere...
My attempt:
import pandas as pd
import numpy as np
#df = pd.read_clipboard(sep='\\s+')
# dft = df.T
frames = {}
n = 0
status = df['Status'].unique()
# Subset and create dataframes
for category in df['Category'].unique():
n = n + 1
newname = 'df_' + str(n)
print(newname)
dfs = df[df['Category']==category]
frames[newname] = dfs
# Join dataframes
df_main = frames['df_1']
frames.pop('df_1')
for key in frames:
df_main = pd.merge(df_main, frames[key], on = 'Category', how = 'outer')
df_main = df_main.set_index(['Category'])
df_main.columns = status
Output df_main:
Started Finished Canceled
Category
A Started NaN NaN
A Started NaN NaN
A Finished NaN NaN
A Finished NaN NaN
B NaN Started NaN
B NaN Canceled NaN
B NaN Canceled NaN
C NaN NaN Started
C NaN NaN Finished