Counting non zero values in each column of a DataFrame in python

Question

I have a python-pandas-DataFrame in which first column is "user_id" and rest of the columns are tags("Tag_0" to "Tag_122").

I have the data in the following format:

UserId  Tag_0   Tag_1
7867688 0   5
7867688 0   3
7867688 3   0
7867688 3.5 3.5
7867688 4   4
7867688 3.5 0

My aim is to achieve Sum(Tag)/Count(NonZero(Tags)) for each user_id

df.groupby('user_id').sum(), gives me sum(tag), however I am clueless about counting non zero values

Is it possible to achieve Sum(Tag)/Count(NonZero(Tags)) in one command?

In MySQL I could achieve this as follows:-

select user_id, sum(tag)/count(nullif(tag,0)) from table group by 1

Any help shall be appreciated.

The Unfun Cat · Answer 1 · 2018-05-09T17:08:44.267

174

My favorite way of getting number of nonzeros in each column is

df.astype(bool).sum(axis=0)

For the number of non-zeros in each row use

df.astype(bool).sum(axis=1)

(Thanks to Skulas)

If you have nans in your df you should make these zero first, otherwise they will be counted as 1.

df.fillna(0).astype(bool).sum(axis=1)

(Thanks to SirC)

edited May 09 '18 at 17:08

answered Dec 08 '15 at 12:39

The Unfun Cat

29,987
31
114
156

2

I think you meant to axis=0. If you do axis=1 you'd be counting non zero **rows** – Skulas Nov 29 '17 at 14:01
1

@skulas Good catch! I guess most people come here for rows and that is why no-one has complained before :) – The Unfun Cat Nov 29 '17 at 14:16
Thats a great one liner! To get all the column values which are not null – Chandra Kanth Mar 18 '18 at 12:52
@Amir would datetypes ever be zero though? – The Unfun Cat Apr 18 '18 at 18:28
2

It is dangerous if you have nan in your dataframe, they would contribute to the sum. – SirC May 09 '18 at 14:21
Good catch, SirC! Updating my answer. – The Unfun Cat May 09 '18 at 17:07

score 37 · Answer 2 · answered Oct 11 '19 at 18:04

37

Why not use np.count_nonzero?

To count the number of non-zeros of an entire dataframe, np.count_nonzero(df)
To count the number of non-zeros of all rows np.count_nonzero(df, axis=0)
To count the number of non-zeros of all columns np.count_nonzero(df, axis=1)

It works with dates too.

answered Oct 11 '19 at 18:04

Sarah

1,854
17
18

4

Thanksfor this answer! I ended up with this solution as I think it is very human-readable. I only modified two things: For my understanding of "getting the number of non-zero values for all rows" (your case 2) I needed `axis=1` instead of `axis=0`. And I preferred to get the output as `pandas.Series`, so I used `result = pd.Series(index=df.index, data=np.count_nonzero(df, axis=1))` – marcu1000s Feb 26 '20 at 14:05

score 14 · Answer 3 · answered Sep 26 '14 at 07:06

14

To count nonzero values, just do (column!=0).sum(), where column is the data you want to do it for. column != 0 returns a boolean array, and True is 1 and False is 0, so summing this gives you the number of elements that match the condition.

So to get your desired result, do

df.groupby('user_id').apply(lambda column: column.sum()/(column != 0).sum())

answered Sep 26 '14 at 07:06

BrenBarn

242,874
37
412
384

@BrenBram What shall be the approach if we have negative values in some of the cells? – Harsh Singal Sep 30 '14 at 10:37
@HarshSingal: `column != 0` will find all values that are not zero, regardless of whether they're positive or negative. – BrenBarn Sep 30 '14 at 17:51
Sorry for not stating the problem precisely. When I implemented above method the user_id's for which the SUM(Tags) was negative returned -inf in the output while positive SUM(Tags) performed perfectly. I have been unable to figure out why! – Harsh Singal Oct 01 '14 at 09:46
@HarshSingal: You could get `inf` if there were no nonzero tags (so that the count of nonzero tags was zero). Your original formulation is not well-defined for that case, so you'll need to think about what you want the result to be. – BrenBarn Oct 01 '14 at 17:58

score 0 · Answer 4 · answered Feb 19 '22 at 07:00

I know this question is old but it seems OP's aim is different from the question title:

My aim is to achieve Sum(Tag)/Count(NonZero(Tags)) for each user_id...

For OP's aim, we could replace 0 with NaN and use groupby + mean (this works because mean skips NaN by default):

out = df.replace(0, np.nan).groupby('UserId', as_index=False).mean()

Output:

    UserId  Tag_0  Tag_1
0  7867688    3.5  3.875

score 0 · Answer 5 · answered May 19 '23 at 18:24

0

A simple list comprehension to get the count of non-zero values in each column of df:

[np.count_nonzero(df[x]) for x in df.columns]

answered May 19 '23 at 18:24

datariel

150
11

Counting non zero values in each column of a DataFrame in python

5 Answers5

Linked