Replicate Countifs() in pandas with multiple conditions

Question

I have a DataFrame of the form:

What I would like to achieve is a DataFrame that has unique groups and a count of non-zeros for each column 1-9. It would look like the below.

I'm pretty much at a loss on this. I saw the question here, but this doesn't address the grouping aspect. How can I achieve this using pandas?

MaxU - stand with Ukraine · Accepted Answer · 2017-03-29T18:20:15.160

Assuming you have the following DF:

In [82]: df
Out[82]:
    Group  1  2  3  4  5
0  Group1  0  1  4  0  1
1  Group1  3  0  4  1  5
2  Group2  0  1  4  3  6
3  Group2  5  1  4  0  7
4  Group3  0  0  4  7  8
5  Group3  7  1  4  7  9

Solution:

In [83]: df.set_index('Group').ne(0).groupby(level=0).sum().reset_index()
Out[83]:
    Group    1    2    3    4    5
0  Group1  1.0  1.0  2.0  1.0  2.0
1  Group2  1.0  2.0  2.0  1.0  2.0
2  Group3  1.0  1.0  2.0  2.0  2.0

as integers:

In [84]: df.set_index('Group').ne(0).groupby(level=0).sum().astype(int).reset_index()
Out[84]:
    Group  1  2  3  4  5
0  Group1  1  1  2  1  2
1  Group2  1  2  2  1  2
2  Group3  1  1  2  2  2

UPDATE:

In [92]: df
Out[92]:
    Group  1  2  3  4  5
0  Group1  0  1  4  0  0
1  Group1  3  0  4  1  0
2  Group2  0  1  4  3  0
3  Group2  5  1  4  0  0
4  Group3  0  0  4  7  0
5  Group3  7  1  4  7  0

In [93]: df.set_index('Group').ne(0).groupby(level=0).sum().astype(int).reset_index()
Out[93]:
    Group  1  2  3  4  5
0  Group1  1  1  2  1  0
1  Group2  1  2  2  1  0
2  Group3  1  1  2  2  0

Timing:

In [37]: %timeit df.set_index('Group').ne(0).groupby(level=0).sum().astype(int).reset_index()
100 loops, best of 3: 6.24 ms per loop

In [38]: %timeit df.pivot_table(index='Group', aggfunc=np.count_nonzero)
100 loops, best of 3: 19.2 ms per loop

let's test it against a bigger (60.000 rows) DF:

In [39]: df = pd.concat([df] * 10**4, ignore_index=True)

In [40]: df.shape
Out[40]: (60000, 6)

In [42]: %timeit df.set_index('Group').ne(0).groupby(level=0).sum().astype(int).reset_index()
10 loops, best of 3: 22.4 ms per loop

In [43]: %timeit df.pivot_table(index='Group', aggfunc=np.count_nonzero)
10 loops, best of 3: 43 ms per loop

Conclusion:

IMO high-level function pivot_table is slower due to additional overhead, but t he difference is not that huge for bigger data sets...

Smart!... point smoothing... I should do that. But I'll probably be over the limit again tomorrow soooo — piRSquared, Mar 28 '17 at 20:28
Meta... Ask questions... I know a bit now. 200 points from upvotes is the limit. Bounties, accepting answers, and answers accepted are exempt. My main concern is `pandas` upvotes which is not affected by the cap. That's why I continue answering well after I reach cap. — piRSquared, Mar 28 '17 at 20:30
An important aspect of this is that each row will ultimately end in zeros, and I would like to preserve all columns. Is that possible to achieve? — Kyle, Mar 28 '17 at 20:43
@MaxU, would you be able to explain why this method is faster than the answer below? Your method averages 1.92 ms on my test DataFrame and the answer below averages 9.54 ms. — Kyle, Mar 29 '17 at 16:03

score 2 · Answer 2 · answered Mar 28 '17 at 20:44

2

Using df from @MaxU solution:

df.pivot_table(index='Group', aggfunc=np.count_nonzero)


        1   2   3   4   5
Group                   
Group1  1   1   2   1   2
Group2  1   2   2   1   2
Group3  1   1   2   2   2

answered Mar 28 '17 at 20:44

Scott Boston

147,308
15
139
187

Replicate Countifs() in pandas with multiple conditions

2 Answers2