0

Let's say I've got three project categories A, B and C that can have three different statuses Started, Finished and canceled. I'm trying to find the percentage of categories that have for example Status = Finished for all categories. I think the first step should be to display it in a matrix like below:

Input:

Category Status
A Started
A Started
A Finished
A Finished
B Started
B Canceled
B Canceled
C Started
C Finished

Desired Output:

    Started Finished Canceled
A   2       2        0
B   1       0        2
C   1       1        0

But I'm struggling to identify the possible statuses that for example A can have, but does not. In this example Canceled = 0. I've been trying to do this by subsetting a pandas dataframe by unique observations for Category, and then joining them to fill non-existent combinations with nan. But I think this would be very slow for bigger datesets. Also, I'm not quite there yet. Here's the code if anyone would like to try and build on it. But I'm suspecting that there's a much more efficient solution out there somewhere...

My attempt:

import pandas as pd
import numpy as np

#df = pd.read_clipboard(sep='\\s+')
# dft = df.T
frames = {}
n = 0

status = df['Status'].unique()

# Subset and create dataframes
for category in df['Category'].unique():
    n = n + 1
    newname = 'df_' + str(n)
    print(newname)
    dfs = df[df['Category']==category]
    frames[newname] = dfs

# Join dataframes
df_main = frames['df_1']
frames.pop('df_1')


for key in frames:
    df_main = pd.merge(df_main, frames[key], on = 'Category', how = 'outer')

df_main = df_main.set_index(['Category'])
df_main.columns = status

Output df_main:

          Started  Finished  Canceled
Category                              
A          Started       NaN       NaN
A          Started       NaN       NaN
A         Finished       NaN       NaN
A         Finished       NaN       NaN
B              NaN   Started       NaN
B              NaN  Canceled       NaN
B              NaN  Canceled       NaN
C              NaN       NaN   Started
C              NaN       NaN  Finished
vestland
  • 55,229
  • 37
  • 187
  • 305
  • 1
    Use `pd.crosstab(df.Category, df.Status)`? – Zero Sep 04 '18 at 13:44
  • I looking for better dupe and add it, now it is better? – jezrael Sep 05 '18 at 07:05
  • And so sorry for not 100% matched dupe before. – jezrael Sep 05 '18 at 07:08
  • Your conlcusion was correct from the beginning. I removed my edit because I for a second messed up the links and though that the link below was the duplicate. My mistake 100%. Sorry. https://stackoverflow.com/questions/23307301/pandas-replacing-column-values-in-dataframe?noredirect=1&lq=1 – vestland Sep 05 '18 at 07:11

0 Answers0