Pandas sum across columns and divide each cell from that value

Question

I have read a csv file and pivoted it to get to following structure:

pivoted = df.pivot('user_id', 'group', 'value')
lookup = df.drop_duplicates('user_id')[['user_id', 'group']]
lookup.set_index(['user_id'], inplace=True)
result = pivoted.join(lookup)
result = result.fillna(0)

Section of the result:

             0     1     2    3     4    5   6  7    8   9  10  11  12  13  group
user_id                                                                      
2        33653  2325   916  720   867  187  31  0    6   3  42  56  92  15    l-1
4        18895   414  1116  570  1190   55  92  0  122  23  78   6   4   2    l-2 
16        1383    70    27   17    17    1   0  0    0   0   1   0   0   0    l-2
50         396    72    34    5    18    0   0  0    0   0   0   0   0   0    l-3
51        3915  1170   402  832  2791  316  12  5  118  51  32   9  62  27    l-4

I want to sum across column 0 to column 13 by each row and divide each cell by the sum of that row. I am still getting used to pandas; if I understand correctly, we should try to avoid for loops when doing things like this? In other words, how can I do this in a 'pandas' way?

I think so, going more sudo: `0/sum(0..13),1/sum(0..13)` also `column 0 `is the name of the column not necessary its index.` — add-semi-colons, Oct 23 '14 at 21:44

score 100 · Answer 1 · edited Jan 05 '22 at 22:21

100

More simply:

result.div(result.sum(axis=1), axis=0)

edited Jan 05 '22 at 22:21

Henry Ecker

34,399
18
41
57

answered Jul 17 '15 at 16:58

Souf Ee

1,021
1
7
7

score 62 · Accepted Answer · answered Oct 23 '14 at 22:13

Try the following:

In [1]: import pandas as pd

In [2]: df = pd.read_csv("test.csv")

In [3]: df
Out[3]: 
  id  value1  value2  value3
0  A       1       2       3
1  B       4       5       6
2  C       7       8       9

In [4]: df["sum"] = df.sum(axis=1)

In [5]: df
Out[5]: 
  id  value1  value2  value3  sum
0  A       1       2       3    6
1  B       4       5       6   15
2  C       7       8       9   24

In [6]: df_new = df.loc[:,"value1":"value3"].div(df["sum"], axis=0)

In [7]: df_new
Out[7]: 
     value1    value2  value3
0  0.166667  0.333333   0.500
1  0.266667  0.333333   0.400
2  0.291667  0.333333   0.375

Or you can do the following:

In [8]: df.loc[:,"value1":"value3"] = df.loc[:,"value1":"value3"].div(df["sum"], axis=0)

In [9]: df
Out[9]: 
  id    value1    value2  value3  sum
0  A  0.166667  0.333333   0.500    6
1  B  0.266667  0.333333   0.400   15
2  C  0.291667  0.333333   0.375   24

Or just straight up from the beginning:

In [10]: df = pd.read_csv("test.csv")

In [11]: df
Out[11]: 
  id  value1  value2  value3
0  A       1       2       3
1  B       4       5       6
2  C       7       8       9

In [12]: df.loc[:,"value1":"value3"] = df.loc[:,"value1":"value3"].div(df.sum(axis=1), axis=0)

In [13]: df
Out[13]: 
  id    value1    value2  value3
0  A  0.166667  0.333333   0.500
1  B  0.266667  0.333333   0.400
2  C  0.291667  0.333333   0.375

Changing the column value1 and the like to your headers should work similarly.

score 14 · Answer 3 · answered Dec 08 '16 at 12:54

14

easier to work per column:

df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
(df.T / df.T.sum()).T

result:

         0         1      2
0  0.166667  0.333333  0.500
1  0.266667  0.333333  0.400
2  0.291667  0.333333  0.375

answered Dec 08 '16 at 12:54

ihadanny

4,377
7
45
76

Much better solution, very low memory consumption :) – MPA Feb 23 '20 at 12:28

EdChum · Answer 4 · 2014-10-23T21:58:13.177

The following seemed to work fine for me:

In [39]:

cols = ['0','1','2','3','4','5','6','7','8','9','10','11','12','13']
result[cols]  = result[cols].apply(lambda row: row / row.sum(axis=1), axis=1)
result

Out[39]:
                0         1         2         3         4         5         6  \
user_id                                                                         
2        0.864827  0.059749  0.023540  0.018503  0.022280  0.004806  0.000797   
4        0.837285  0.018345  0.049453  0.025258  0.052732  0.002437  0.004077   
16       0.912269  0.046174  0.017810  0.011214  0.011214  0.000660  0.000000   
50       0.754286  0.137143  0.064762  0.009524  0.034286  0.000000  0.000000   
51       0.401868  0.120099  0.041265  0.085403  0.286491  0.032437  0.001232   

                7         8         9        10        11        12        13  \
user_id                                                                         
2        0.000000  0.000154  0.000077  0.001079  0.001439  0.002364  0.000385   
4        0.000000  0.005406  0.001019  0.003456  0.000266  0.000177  0.000089   
16       0.000000  0.000000  0.000000  0.000660  0.000000  0.000000  0.000000   
50       0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
51       0.000513  0.012113  0.005235  0.003285  0.000924  0.006364  0.002772   

        group  
user_id        
2         l-1  
4         l-2  
16        l-2  
50        l-3  
51        l-4

OK scratch the above, the following will be much faster:

result[cols]  = result[cols].div(result[cols].sum(axis=1), axis=0)

And just to prove the result is the same:

In [47]:

cols = ['0','1','2','3','4','5','6','7','8','9','10','11','12','13']
np.alltrue(result[cols].div(result[cols].sum(axis=1), axis=0) == result[cols].apply(lambda row: row / row.sum(axis=1), axis=1))
Out[47]:
True

And that it's faster:

In [48]:

cols = ['0','1','2','3','4','5','6','7','8','9','10','11','12','13']
%timeit result[cols].div(result[cols].sum(axis=1), axis=0) 
%timeit result[cols].apply(lambda row: row / row.sum(axis=1), axis=1)
100 loops, best of 3: 2.38 ms per loop
100 loops, best of 3: 4.47 ms per loop

sorry i wanted to sum it all the way till column 13. there was a typo. — add-semi-colons, Oct 23 '14 at 21:46
@Null-Hypothesis my original answer was incorrect, have corrected it now, I now apply row-wise dividing each item by the row sum, the values seem correct to me now — EdChum, Oct 23 '14 at 21:52
without doing this there is no other way to mention all columns `cols = ['0','1','2','3','4','5','6','7','8','9','10','11','12','13']`×Comments may only be edited for 5 minutes×Comments may only be edited for 5 minutes×Comments may only be edited for 5 minutes — add-semi-colons, Oct 23 '14 at 23:42
Yeah @Nanashi this line did the job `df.loc[:,"value1":"value3"] = df.loc[:,"value1":"value3"].div(df.sum(axis=1),` — add-semi-colons, Oct 24 '14 at 13:23

score 0 · Answer 5 · answered Mar 04 '22 at 16:58

0

result.iloc[:,:-1].div(result.iloc[:,:-1].sum(axis=1), axis=0)

result.iloc[:,:-1] gets all rows and columns except last column

result.iloc[:,:-1].sum(axis=1) sums across a row due to axis=1, default is axis=0 i.e. column

result.div(result, axis=0) axis=0 because default for div is column i.e. axis=1

answered Mar 04 '22 at 16:58

Vilas Yuvraj Deshmukh

1

This answer does not seem unique since it contains elements already included in other answers – aaossa Mar 08 '22 at 18:56

Pandas sum across columns and divide each cell from that value

5 Answers5

Linked

Related