3

I have a DataFrame of the form:

enter image description here

What I would like to achieve is a DataFrame that has unique groups and a count of non-zeros for each column 1-9. It would look like the below.

enter image description here

I'm pretty much at a loss on this. I saw the question here, but this doesn't address the grouping aspect. How can I achieve this using pandas?

Community
  • 1
  • 1
Kyle
  • 2,543
  • 2
  • 16
  • 31

2 Answers2

3

Assuming you have the following DF:

In [82]: df
Out[82]:
    Group  1  2  3  4  5
0  Group1  0  1  4  0  1
1  Group1  3  0  4  1  5
2  Group2  0  1  4  3  6
3  Group2  5  1  4  0  7
4  Group3  0  0  4  7  8
5  Group3  7  1  4  7  9

Solution:

In [83]: df.set_index('Group').ne(0).groupby(level=0).sum().reset_index()
Out[83]:
    Group    1    2    3    4    5
0  Group1  1.0  1.0  2.0  1.0  2.0
1  Group2  1.0  2.0  2.0  1.0  2.0
2  Group3  1.0  1.0  2.0  2.0  2.0

as integers:

In [84]: df.set_index('Group').ne(0).groupby(level=0).sum().astype(int).reset_index()
Out[84]:
    Group  1  2  3  4  5
0  Group1  1  1  2  1  2
1  Group2  1  2  2  1  2
2  Group3  1  1  2  2  2

UPDATE:

In [92]: df
Out[92]:
    Group  1  2  3  4  5
0  Group1  0  1  4  0  0
1  Group1  3  0  4  1  0
2  Group2  0  1  4  3  0
3  Group2  5  1  4  0  0
4  Group3  0  0  4  7  0
5  Group3  7  1  4  7  0

In [93]: df.set_index('Group').ne(0).groupby(level=0).sum().astype(int).reset_index()
Out[93]:
    Group  1  2  3  4  5
0  Group1  1  1  2  1  0
1  Group2  1  2  2  1  0
2  Group3  1  1  2  2  0

Timing:

In [37]: %timeit df.set_index('Group').ne(0).groupby(level=0).sum().astype(int).reset_index()
100 loops, best of 3: 6.24 ms per loop

In [38]: %timeit df.pivot_table(index='Group', aggfunc=np.count_nonzero)
100 loops, best of 3: 19.2 ms per loop

let's test it against a bigger (60.000 rows) DF:

In [39]: df = pd.concat([df] * 10**4, ignore_index=True)

In [40]: df.shape
Out[40]: (60000, 6)

In [42]: %timeit df.set_index('Group').ne(0).groupby(level=0).sum().astype(int).reset_index()
10 loops, best of 3: 22.4 ms per loop

In [43]: %timeit df.pivot_table(index='Group', aggfunc=np.count_nonzero)
10 loops, best of 3: 43 ms per loop

Conclusion:

IMO high-level function pivot_table is slower due to additional overhead, but t he difference is not that huge for bigger data sets...

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • Smart!... point smoothing... I should do that. But I'll probably be over the limit again tomorrow soooo – piRSquared Mar 28 '17 at 20:28
  • 1
    Meta... Ask questions... I know a bit now. 200 points from upvotes is the limit. Bounties, accepting answers, and answers accepted are exempt. My main concern is `pandas` upvotes which is not affected by the cap. That's why I continue answering well after I reach cap. – piRSquared Mar 28 '17 at 20:30
  • An important aspect of this is that each row will ultimately end in zeros, and I would like to preserve all columns. Is that possible to achieve? – Kyle Mar 28 '17 at 20:43
  • @Kyle, hmm... i can't reproduce it - see UPDATE section – MaxU - stand with Ukraine Mar 28 '17 at 21:02
  • I was not applying this correctly. This works as expected. – Kyle Mar 28 '17 at 21:02
  • @MaxU, would you be able to explain why this method is faster than the answer below? Your method averages 1.92 ms on my test DataFrame and the answer below averages 9.54 ms. – Kyle Mar 29 '17 at 16:03
2

Using df from @MaxU solution:

df.pivot_table(index='Group', aggfunc=np.count_nonzero)


        1   2   3   4   5
Group                   
Group1  1   1   2   1   2
Group2  1   2   2   1   2
Group3  1   1   2   2   2
Scott Boston
  • 147,308
  • 15
  • 139
  • 187