How to calculate percentage of missing values

Question

Edit: Apologies I actually missed out on an important grouping of data. Thanks for those who already helped.

I have a data set that has missing data. I have filled the missing values with 0. Using Python and Pandas I am trying to get to a metric for each team, the % of Apps they are working on that are complete. My thought was to groupby on ColA, then do counts on Col C, but I cant figure out how to get counts of complete and counts of total to do the calculation. Any ideas are much appreciated.

So I want something that looks like this

  Team A  App1 High 0%
  Team A  App3 Med  100%
  Team B  App2 Med  0%
  And so on.

My df looks like the following

  +--------+-------+-------+----------+
  | Col A  | Col B | Col C |  Col D   |
  +--------+-------+-------+----------+
  | Team A | App1  | High  | 0        |
  | Team A | App1  | High  | 0        |
  | Team A | App3  | Med   | Complete |
  | Team B | App2  | Med   | 0        |
  | Team B | App2  | High  | Complete |
  | Team C | App1  | Low   | Complete |
  +--------+-------+-------+----------+

I honestly don't know how to come up with this answer. The furthest I have gotten is something like a = df.groupby(['Col A', 'Col B', 'Col C']).agg({'Col D' : 'count'}) b = df.groupby(['Col A', 'Col C']).agg({'Col D' : 'count'}) print both a and b to excel then do a manual calculation. Sorry if the exact formulas are off but I think they are close. — DataNoob, Jan 14 '16 at 01:40

Stefan · Accepted Answer · 2016-01-14T01:50:26.477

0

df['count'] = df.groupby(['Col A', 'Col B', 'Col C'])['Col D'].transform(lambda x: (x==0).sum())
df['share'] = df.groupby(['Col A', 'Col B', 'Col C'])['Col D'].transform(lambda x: '{:.2f}%'.format((x==0).sum()/len(x)*100))

yields:

      Col A    Col B    Col C       Col D count    share
0   Team A    App1     High             0     2  100.00%
1   Team A    App1     High             0     2  100.00%
2   Team A    App3     Med      Complete      0    0.00%
3   Team B    App2     Med              0     1  100.00%
4   Team B    App2     High     Complete      0    0.00%
5   Team C    App1     Low      Complete      0    0.00%

or just:

df.groupby(['Col A', 'Col B', 'Col C'])['Col D'].apply(lambda x: '{:.2f}%'.format((x==0).sum()/len(x)*100))

Col A     Col B    Col C  
 Team A    App1     High      100.00%
           App3     Med         0.00%
 Team B    App2     High        0.00%
                    Med       100.00%
 Team C    App1     Low         0.00%

edited Jan 14 '16 at 01:50

answered Jan 14 '16 at 00:51

Stefan

41,759
13
76
81

Hi - I am still a noob to using Python and Pandas. Can you help me understand what is happening here? – DataNoob Jan 14 '16 at 01:32
See updated answer and docs on selecting by `.loc[]`: http://pandas.pydata.org/pandas-docs/stable/indexing.html – Stefan Jan 14 '16 at 01:39
Ok so it now looks like both input and output are actually something different. – Stefan Jan 14 '16 at 01:42
Thank you much appreciated. As a noob, where can I go to learn more about using lambda? – DataNoob Jan 14 '16 at 02:04
1

See http://pandas.pydata.org/pandas-docs/stable/basics.html#row-or-column-wise-function-application or http://stackoverflow.com/questions/16353729/pandas-how-to-use-apply-function-to-multiple-columns or http://www.python-course.eu/lambda.php – Stefan Jan 14 '16 at 02:08

How to calculate percentage of missing values

1 Answers1