1

I have the following dataframe in which I am trying to count the total number of the values of "MISSING" per column grouped by A, B and C but I am trying to keep it in the format below , I am not sure if I have to reshape the dataframe or use stack:


               A   B           C        D        E        F  
0     Miami Heat  FL  Basketball       21  MISSING  MISSING            
1     Miami Heat  FL  Basketball       17  MISSING  MISSING            
2     Miami Heat  FL  Basketball  MISSING       12       23             
3  Orlando Magic  FL  Basketball  MISSING        5  MISSING             
4  Orlando Magic  FL  Basketball       10  MISSING  MISSING             
5  Orlando Magic  FL  Basketball        5  MISSING  MISSING             

This is what I would like the result to be:

A        B   Column Name   Missing Count
Miami   Heat FL  D             1
Miami   Heat FL  E             2
Miami   Heat FL  F             2
Orlando Magic FL D             1
Orlando Magic FL E             2
Orlando Magic FL F             3

This code does not work:

df = finalDF.assign(Missing Count = finalDF.eq("MISSING").sum(axis=1))
.groupby(['A','B','C']) ['Missing Count'].count()

celtics33
  • 15
  • 6

2 Answers2

0

Use:

new_df = (df.assign(**df[['D', 'E', 'F']].isnull())
            .pivot_table(index=['A', 'B'], values=['D', 'E', 'F'], aggfunc='sum')
            .reset_index()
            .melt(['A', 'B'], var_name='Column Name', value_name='Missing Count')
            .sort_values('A'))
print(new_df)
               A   B Column Name  Missing Count
0     Miami Heat  FL           D            1.0
2     Miami Heat  FL           E            2.0
4     Miami Heat  FL           F            2.0
1  Orlando Magic  FL           D            1.0
3  Orlando Magic  FL           E            2.0
5  Orlando Magic  FL           F            3.0

if they aren't NaN

df[['D', 'E', 'F']].eq('MISSING')

First step : DataFrame.assign with **:

print(df.assign(**df[['D', 'E', 'F']].isnull()))
               A   B           C      D      E      F
0     Miami_Heat  FL  Basketball  False   True   True
1     Miami_Heat  FL  Basketball  False   True   True
2     Miami_Heat  FL  Basketball   True  False  False
3  Orlando_Magic  FL  Basketball   True  False   True
4  Orlando_Magic  FL  Basketball  False   True   True
5  Orlando_Magic  FL  Basketball  False   True   True

This is a dynamic way to create a DataFrame with True or False in the columns where we want to count missing values. To understand ** you can see : What is the purpose and use of **kwargs? and What does ** (double star/asterisk) and * (star/asterisk) do for parameters?

Second step

Now we can count, one of the simple ways is by DataFrame.pivot_table, we could also have used groupby.

print(df.assign(**df[['D', 'E', 'F']].isnull())
        .pivot_table(index=['A', 'B'], values=['D', 'E', 'F'],aggfunc='sum'))
                    D    E    F
A             B                
Miami_Heat    FL  1.0  2.0  2.0
Orlando_Magic FL  1.0  2.0  3.0

I recommend you see : How pivot a DataFrame

Finally: DataFrame.melt using DataFrame.reset_index
previously not to lose information about the index.

print(df.assign(**df[['D', 'E', 'F']].isnull())
      .pivot_table(index=['A', 'B'], values=['D', 'E', 'F'], aggfunc = 'sum')
      .reset_index()
      .melt(['A', 'B'], var_name='Column Name', value_name='Missing Count'))
               A   B Column Name  Missing Count
0     Miami_Heat  FL           D            1.0
1  Orlando_Magic  FL           D            1.0
2     Miami_Heat  FL           E            2.0
3  Orlando_Magic  FL           E            2.0
4     Miami_Heat  FL           F            2.0
5  Orlando_Magic  FL           F            3.0

maybe here DataFrame.stack + reset_index could be also works well. But we need then .rename_axis(columns = 'Column Name') previously and .rename('Missing Count') before reset_index


DataFrame.sort_values in order to get expected output sorted.

we could also have melted first and then count....

ansev
  • 30,322
  • 5
  • 17
  • 31
  • 1
    Thank you, this works...can you please walk me through the code so I can fully understand what is going on here. – celtics33 Feb 21 '20 at 11:37
0
df = pd.melt(df, id_vars=['A','B'])
df.rename(columns={'variable':'Column Name'}, inplace=True)
df[df['value']=='MISSING'].groupby(['A','B', 'Column Name']).count().reset_index()
Chris
  • 15,819
  • 3
  • 24
  • 37
  • 1
    Can you explain your answer? – AMC Feb 20 '20 at 18:50
  • While this code may provide a solution to OP's problem, it is highly recommended that you provide additional context regarding why and/or how this code answers the question. Code only answers typically become useless in the long-run because future viewers experiencing similar problems cannot understand the reasoning behind the solution. – E. Zeytinci Feb 20 '20 at 20:33