94

I have a python-pandas-DataFrame in which first column is "user_id" and rest of the columns are tags("Tag_0" to "Tag_122").

I have the data in the following format:

UserId  Tag_0   Tag_1
7867688 0   5
7867688 0   3
7867688 3   0
7867688 3.5 3.5
7867688 4   4
7867688 3.5 0

My aim is to achieve Sum(Tag)/Count(NonZero(Tags)) for each user_id

df.groupby('user_id').sum(), gives me sum(tag), however I am clueless about counting non zero values

Is it possible to achieve Sum(Tag)/Count(NonZero(Tags)) in one command?

In MySQL I could achieve this as follows:-

select user_id, sum(tag)/count(nullif(tag,0)) from table group by 1

Any help shall be appreciated.

Harsh Singal
  • 949
  • 1
  • 6
  • 3

5 Answers5

174

My favorite way of getting number of nonzeros in each column is

df.astype(bool).sum(axis=0)

For the number of non-zeros in each row use

df.astype(bool).sum(axis=1)

(Thanks to Skulas)

If you have nans in your df you should make these zero first, otherwise they will be counted as 1.

df.fillna(0).astype(bool).sum(axis=1)

(Thanks to SirC)

The Unfun Cat
  • 29,987
  • 31
  • 114
  • 156
37

Why not use np.count_nonzero?

  1. To count the number of non-zeros of an entire dataframe, np.count_nonzero(df)
  2. To count the number of non-zeros of all rows np.count_nonzero(df, axis=0)
  3. To count the number of non-zeros of all columns np.count_nonzero(df, axis=1)

It works with dates too.

Sarah
  • 1,854
  • 17
  • 18
  • 4
    Thanksfor this answer! I ended up with this solution as I think it is very human-readable. I only modified two things: For my understanding of "getting the number of non-zero values for all rows" (your case 2) I needed `axis=1` instead of `axis=0`. And I preferred to get the output as `pandas.Series`, so I used `result = pd.Series(index=df.index, data=np.count_nonzero(df, axis=1))` – marcu1000s Feb 26 '20 at 14:05
14

To count nonzero values, just do (column!=0).sum(), where column is the data you want to do it for. column != 0 returns a boolean array, and True is 1 and False is 0, so summing this gives you the number of elements that match the condition.

So to get your desired result, do

df.groupby('user_id').apply(lambda column: column.sum()/(column != 0).sum())
BrenBarn
  • 242,874
  • 37
  • 412
  • 384
  • @BrenBram What shall be the approach if we have negative values in some of the cells? – Harsh Singal Sep 30 '14 at 10:37
  • @HarshSingal: `column != 0` will find all values that are not zero, regardless of whether they're positive or negative. – BrenBarn Sep 30 '14 at 17:51
  • Sorry for not stating the problem precisely. When I implemented above method the user_id's for which the SUM(Tags) was negative returned -inf in the output while positive SUM(Tags) performed perfectly. I have been unable to figure out why! – Harsh Singal Oct 01 '14 at 09:46
  • @HarshSingal: You could get `inf` if there were no nonzero tags (so that the count of nonzero tags was zero). Your original formulation is not well-defined for that case, so you'll need to think about what you want the result to be. – BrenBarn Oct 01 '14 at 17:58
0

I know this question is old but it seems OP's aim is different from the question title:

My aim is to achieve Sum(Tag)/Count(NonZero(Tags)) for each user_id...


For OP's aim, we could replace 0 with NaN and use groupby + mean (this works because mean skips NaN by default):

out = df.replace(0, np.nan).groupby('UserId', as_index=False).mean()

Output:

    UserId  Tag_0  Tag_1
0  7867688    3.5  3.875
0

A simple list comprehension to get the count of non-zero values in each column of df:

[np.count_nonzero(df[x]) for x in df.columns]
datariel
  • 150
  • 11