Pandas count NAs with a groupby for all columns

Question

This question shows how to count NAs in a dataframe for a particular column C. How do I count NAs for all columns (that aren't the groupby column)?

Here is some test code that doesn't work:

#!/usr/bin/env python3

import pandas as pd
import numpy as np

df = pd.DataFrame({'a':[1,1,2,2], 
                   'b':[1,np.nan,2,np.nan],
                   'c':[1,np.nan,2,3]})

# result = df.groupby('a').isna().sum()
# AttributeError: Cannot access callable attribute 'isna' of 'DataFrameGroupBy' objects, try using the 'apply' method

# result = df.groupby('a').transform('isna').sum()
# AttributeError: Cannot access callable attribute 'isna' of 'DataFrameGroupBy' objects, try using the 'apply' method

result = df.isna().groupby('a').sum()
print(result)
# result:
#          b    c
# a
# False  2.0  1.0

result = df.groupby('a').apply(lambda _df: df.isna().sum())
print(result)
# result:
#    a  b  c
# a
# 1  0  2  1
# 2  0  2  1

Desired output:

     b    c
a
1    1    1
2    1    0

score 3 · Accepted Answer · answered Sep 18 '19 at 15:20

It's always best to avoid groupby.apply in favor of the basic functions which are cythonized, as this scales better with many groups. This will lead to a great increase in performance. In this case first check isnull() on the entire DataFrame then groupby + sum.

df[df.columns.difference(['a'])].isnull().groupby(df.a).sum().astype(int)
#   b  c
#a      
#1  1  1
#2  1  0

To illustrate the performance gain:

import pandas as pd
import numpy as np

N = 50000
df = pd.DataFrame({'a': [*range(N//2)]*2,
                   'b': np.random.choice([1, np.nan], N),
                   'c': np.random.choice([1, np.nan], N)})

%timeit df[df.columns.difference(['a'])].isnull().groupby(df.a).sum().astype(int)
#7.89 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df.groupby('a')[['b', 'c']].apply(lambda x: x.isna().sum())
#9.47 s ± 111 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Erfan · Answer 2 · 2019-09-18T15:26:08.093

2

Using apply with isna and sum. Plus we select the correct columns, so we don't get the unnecessary a column:

Note: apply can be slow, it's recommended to use one of the vectorized solutions, see the answers of WenYoBen, Anky or ALollz

df.groupby('a')[['b', 'c']].apply(lambda x: x.isna().sum())

Output

edited Sep 18 '19 at 15:26

answered Sep 18 '19 at 15:13

Erfan

40,971
8
66
78

pissall · Answer 3 · 2019-09-18T15:19:29.853

2

Your question has the answer (You mistyped _df as df):

result = df.groupby('a')['b', 'c'].apply(lambda _df: _df.isna().sum())
result
   b  c
a      
1  1  1
2  1  0

edited Sep 18 '19 at 15:19

answered Sep 18 '19 at 15:14

pissall

7,109
2
25
45

anky · Answer 4 · 2019-09-18T15:29:04.117

2

Another way would be set_index() on a and groupby on the index and sum:

df.set_index('a').isna().groupby(level=0).sum()*1

Or:

df.set_index('a').isna().groupby(level=0).sum().astype(int)

Or without groupby courtesy @WenYoBen:

df.set_index('a').isna().sum(level=0).astype(int)

edited Sep 18 '19 at 15:29

answered Sep 18 '19 at 15:21

anky

74,114
11
41
70

1

`df.set_index('a').isna().sum(level=0).astype(int)` – BENY Sep 18 '19 at 15:27

score 1 · Answer 5 · answered Sep 18 '19 at 15:12

I will do count then sub with value_counts, the reason why I did not using apply , cause it is usually has bad performance

df.groupby('a')[['b','c']].count().rsub(df.a.value_counts(dropna=False),axis=0)
Out[78]: 
   b  c
1  1  1
2  1  0

Alternative

df.isna().drop('a',1).astype(int).groupby(df['a']).sum()
Out[83]: 
   b  c
a      
1  1  1
2  1  0

score 1 · Answer 6 · answered Sep 18 '19 at 15:15

1

You need to drop the column after using apply.

df.groupby('a').apply(lambda x: x.isna().sum()).drop('a',1)

Output:

answered Sep 18 '19 at 15:15

harvpan

8,571
2
18
36

score 1 · Answer 7 · answered Sep 18 '19 at 15:22

1

Another dirty work:

df.set_index('a').isna().astype(int).groupby(level=0).sum()

Output:

answered Sep 18 '19 at 15:22

Quang Hoang

146,074
10
56
74

score 0 · Answer 8 · answered Sep 18 '19 at 15:15

0

You could write your own aggregation function as follows:

df.groupby('a').agg(lambda x: x.isna().sum())

which results in

     b    c
a          
1  1.0  1.0
2  1.0  0.0

answered Sep 18 '19 at 15:15

stahamtan

848
6
10

Pandas count NAs with a groupby for all columns

8 Answers8

Linked