0

I have a dataframe, df, which looks like this:

|    | rating |  foo1 | foo2 |  foo3 | foo4 |  foo5 | 
|:--:|:------:|:-----:|:----:|:-----:|:----:|:-----:|
|  1 |    2   |   0   |   0  |  0.98 |   0  |  0.7  |
|  2 |    2   |   0   |   0  |   0   |  0.3 | 0.007 |
|  3 |    2   |   0   |   0  |   0   |   0  |   0   |
|  4 |    4   |  0.1  | 0.99 |   0   |   0  | 0.005 |
|  5 |    4   |   0   |   0  |   0   |   0  |  0.01 |
|  6 |    2   |   0   |   0  |  0.66 |   0  |  0.27 |
|  7 |    4   |   0   | 0.92 |  0.32 |   0  |  0.11 |
|  8 |    2   | 0.003 |   0  | 0.073 |   0  | 0.218 |
|  9 |    4   |   0   |   0  |   0   |   0  | 0.004 |
| 10 |    4   |   0   |   0  |   0   |   0  | 0.001 |

except that I have about 13,000 features, and only care about a certain subset (say foo1, foo2, foo3, foo4, and foo5)

The shape of my df is: 2000 rows x 13984 columns

What I need to do is count the number of non zeroes per column and group it by the rating, to hopefully produce a result like:

|   | foo1 | foo2 | foo3 | foo4 | foo5 |
|:-:|:----:|:----:|:----:|:----:|:----:|
| 2 |   1  |   0  |   3  |   1  |   4  |
| 4 |   1  |   2  |   1  |   0  |   5  |

I know in SQL, I could do something like:

SELECT
        rating,
        SUM(CASE WHEN foo1 != 0 THEN 1 ELSE 0 END) as foo1,
        SUM(CASE WHEN foo2 != 0 THEN 1 ELSE 0 END) as foo2,
        SUM(CASE WHEN foo3 != 0 THEN 1 ELSE 0 END) as foo3,
        SUM(CASE WHEN foo4 != 0 THEN 1 ELSE 0 END) as foo4,
        SUM(CASE WHEN foo5 != 0 THEN 1 ELSE 0 END) as foo5

FROM
        df

GROUP BY
        rating

I have found this Stack Overflow post but this is how to create a similar calculation for all columns, and I only care about a specific five (foo1, foo2, foo3, foo4, foo5)

How can I write a solution to achieve the desired result using python pandas?

artemis
  • 6,857
  • 11
  • 46
  • 99

2 Answers2

3

If I understand you correctly, first set_index to rating, then groupby:

import numpy as np
import pandas as pd

np.random.seed(500)

e = {"rating":np.random.choice([2,4],100),
     "foo1": np.random.randint(0,2,100),
     "foo2": np.random.randint(0,2,100),
     "foo3": np.random.randint(0,2,100),
     "foo4": np.random.randint(0,2,100)}

df = pd.DataFrame(e)
df = df.set_index("rating")
print (df.groupby(df.index).apply(lambda x: x.ne(0).sum()))

#
        foo1  foo2  foo3  foo4
rating                        
2         21    21    24    19
4         32    26    24    30
Henry Yik
  • 22,275
  • 4
  • 18
  • 40
  • Thanks Henry. Good Suggestion. Unfortunately, I don't know where the other ratings came from (I only had two) and I also am not looking for sum...I am looking for a count of the non 0 items. – artemis Oct 24 '19 at 04:24
  • 1
    `sum` on `boolean` True/False is a count. – Henry Yik Oct 24 '19 at 04:24
  • How do I specify to only perform this function on columns `foo1`, `foo2`, `foo3`, `foo4`, and `foo5`, and not all ~14K columns I have? – artemis Oct 24 '19 at 04:28
  • you can use `cols=df.columns[1:6]` to select the columns you want to do this on & apply the lambda on those. Check the answer i have put. – moys Oct 24 '19 at 04:30
  • 1
    `df[["foo1","foo2"...]]`.groupby(...)`. – Henry Yik Oct 24 '19 at 04:30
  • @moys that assumes they are in that order; they are not. – artemis Oct 24 '19 at 04:31
  • @JerryM. you can do `cols=df.columns[[1,2,3,5],]` & the number in there are the columns that you want – moys Oct 24 '19 at 04:34
0

You can do it this way

cols=df.columns[1:6]
df.groupby('rating')[cols].apply(lambda x: x.ne(0).sum()).reset_index()

#

rating  foo1    foo2    foo3    foo4    foo5
0   2   1   0   3   1   4
1   4   1   2   1   0   5
moys
  • 7,747
  • 2
  • 11
  • 42