4

This question shows how to count NAs in a dataframe for a particular column C. How do I count NAs for all columns (that aren't the groupby column)?

Here is some test code that doesn't work:

#!/usr/bin/env python3

import pandas as pd
import numpy as np

df = pd.DataFrame({'a':[1,1,2,2], 
                   'b':[1,np.nan,2,np.nan],
                   'c':[1,np.nan,2,3]})

# result = df.groupby('a').isna().sum()
# AttributeError: Cannot access callable attribute 'isna' of 'DataFrameGroupBy' objects, try using the 'apply' method

# result = df.groupby('a').transform('isna').sum()
# AttributeError: Cannot access callable attribute 'isna' of 'DataFrameGroupBy' objects, try using the 'apply' method

result = df.isna().groupby('a').sum()
print(result)
# result:
#          b    c
# a
# False  2.0  1.0

result = df.groupby('a').apply(lambda _df: df.isna().sum())
print(result)
# result:
#    a  b  c
# a
# 1  0  2  1
# 2  0  2  1

Desired output:

     b    c
a
1    1    1
2    1    0
dfrankow
  • 20,191
  • 41
  • 152
  • 214

8 Answers8

3

It's always best to avoid groupby.apply in favor of the basic functions which are cythonized, as this scales better with many groups. This will lead to a great increase in performance. In this case first check isnull() on the entire DataFrame then groupby + sum.

df[df.columns.difference(['a'])].isnull().groupby(df.a).sum().astype(int)
#   b  c
#a      
#1  1  1
#2  1  0

To illustrate the performance gain:

import pandas as pd
import numpy as np

N = 50000
df = pd.DataFrame({'a': [*range(N//2)]*2,
                   'b': np.random.choice([1, np.nan], N),
                   'c': np.random.choice([1, np.nan], N)})

%timeit df[df.columns.difference(['a'])].isnull().groupby(df.a).sum().astype(int)
#7.89 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df.groupby('a')[['b', 'c']].apply(lambda x: x.isna().sum())
#9.47 s ± 111 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
ALollz
  • 57,915
  • 7
  • 66
  • 89
2

Using apply with isna and sum. Plus we select the correct columns, so we don't get the unnecessary a column:

Note: apply can be slow, it's recommended to use one of the vectorized solutions, see the answers of WenYoBen, Anky or ALollz

df.groupby('a')[['b', 'c']].apply(lambda x: x.isna().sum())

Output

   b  c
a      
1  1  1
2  1  0
Erfan
  • 40,971
  • 8
  • 66
  • 78
2

Your question has the answer (You mistyped _df as df):

result = df.groupby('a')['b', 'c'].apply(lambda _df: _df.isna().sum())
result
   b  c
a      
1  1  1
2  1  0
pissall
  • 7,109
  • 2
  • 25
  • 45
2

Another way would be set_index() on a and groupby on the index and sum:

df.set_index('a').isna().groupby(level=0).sum()*1

Or:

df.set_index('a').isna().groupby(level=0).sum().astype(int)

Or without groupby courtesy @WenYoBen:

df.set_index('a').isna().sum(level=0).astype(int)

   b  c
a      
1  1  1
2  1  0
anky
  • 74,114
  • 11
  • 41
  • 70
1

I will do count then sub with value_counts, the reason why I did not using apply , cause it is usually has bad performance

df.groupby('a')[['b','c']].count().rsub(df.a.value_counts(dropna=False),axis=0)
Out[78]: 
   b  c
1  1  1
2  1  0

Alternative

df.isna().drop('a',1).astype(int).groupby(df['a']).sum()
Out[83]: 
   b  c
a      
1  1  1
2  1  0
BENY
  • 317,841
  • 20
  • 164
  • 234
1

You need to drop the column after using apply.

df.groupby('a').apply(lambda x: x.isna().sum()).drop('a',1)

Output:

    b   c
a       
1   1   1
2   1   0
harvpan
  • 8,571
  • 2
  • 18
  • 36
1

Another dirty work:

df.set_index('a').isna().astype(int).groupby(level=0).sum()

Output:

    b   c
a       
1   1   1
2   1   0
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
0

You could write your own aggregation function as follows:

df.groupby('a').agg(lambda x: x.isna().sum())

which results in

     b    c
a          
1  1.0  1.0
2  1.0  0.0
stahamtan
  • 848
  • 6
  • 10