77

I have read a csv file and pivoted it to get to following structure:

pivoted = df.pivot('user_id', 'group', 'value')
lookup = df.drop_duplicates('user_id')[['user_id', 'group']]
lookup.set_index(['user_id'], inplace=True)
result = pivoted.join(lookup)
result = result.fillna(0) 

Section of the result:

             0     1     2    3     4    5   6  7    8   9  10  11  12  13  group
user_id                                                                      
2        33653  2325   916  720   867  187  31  0    6   3  42  56  92  15    l-1
4        18895   414  1116  570  1190   55  92  0  122  23  78   6   4   2    l-2 
16        1383    70    27   17    17    1   0  0    0   0   1   0   0   0    l-2
50         396    72    34    5    18    0   0  0    0   0   0   0   0   0    l-3
51        3915  1170   402  832  2791  316  12  5  118  51  32   9  62  27    l-4

I want to sum across column 0 to column 13 by each row and divide each cell by the sum of that row. I am still getting used to pandas; if I understand correctly, we should try to avoid for loops when doing things like this? In other words, how can I do this in a 'pandas' way?

Archie
  • 2,247
  • 1
  • 18
  • 35
add-semi-colons
  • 18,094
  • 55
  • 145
  • 232

5 Answers5

100

More simply:

result.div(result.sum(axis=1), axis=0)
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
Souf Ee
  • 1,021
  • 1
  • 7
  • 7
62

Try the following:

In [1]: import pandas as pd

In [2]: df = pd.read_csv("test.csv")

In [3]: df
Out[3]: 
  id  value1  value2  value3
0  A       1       2       3
1  B       4       5       6
2  C       7       8       9

In [4]: df["sum"] = df.sum(axis=1)

In [5]: df
Out[5]: 
  id  value1  value2  value3  sum
0  A       1       2       3    6
1  B       4       5       6   15
2  C       7       8       9   24

In [6]: df_new = df.loc[:,"value1":"value3"].div(df["sum"], axis=0)

In [7]: df_new
Out[7]: 
     value1    value2  value3
0  0.166667  0.333333   0.500
1  0.266667  0.333333   0.400
2  0.291667  0.333333   0.375

Or you can do the following:

In [8]: df.loc[:,"value1":"value3"] = df.loc[:,"value1":"value3"].div(df["sum"], axis=0)

In [9]: df
Out[9]: 
  id    value1    value2  value3  sum
0  A  0.166667  0.333333   0.500    6
1  B  0.266667  0.333333   0.400   15
2  C  0.291667  0.333333   0.375   24

Or just straight up from the beginning:

In [10]: df = pd.read_csv("test.csv")

In [11]: df
Out[11]: 
  id  value1  value2  value3
0  A       1       2       3
1  B       4       5       6
2  C       7       8       9

In [12]: df.loc[:,"value1":"value3"] = df.loc[:,"value1":"value3"].div(df.sum(axis=1), axis=0)

In [13]: df
Out[13]: 
  id    value1    value2  value3
0  A  0.166667  0.333333   0.500
1  B  0.266667  0.333333   0.400
2  C  0.291667  0.333333   0.375

Changing the column value1 and the like to your headers should work similarly.

WGS
  • 13,969
  • 4
  • 48
  • 51
14

easier to work per column:

df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
(df.T / df.T.sum()).T

result:

         0         1      2
0  0.166667  0.333333  0.500
1  0.266667  0.333333  0.400
2  0.291667  0.333333  0.375
ihadanny
  • 4,377
  • 7
  • 45
  • 76
7

The following seemed to work fine for me:

In [39]:

cols = ['0','1','2','3','4','5','6','7','8','9','10','11','12','13']
result[cols]  = result[cols].apply(lambda row: row / row.sum(axis=1), axis=1)
result

Out[39]:
                0         1         2         3         4         5         6  \
user_id                                                                         
2        0.864827  0.059749  0.023540  0.018503  0.022280  0.004806  0.000797   
4        0.837285  0.018345  0.049453  0.025258  0.052732  0.002437  0.004077   
16       0.912269  0.046174  0.017810  0.011214  0.011214  0.000660  0.000000   
50       0.754286  0.137143  0.064762  0.009524  0.034286  0.000000  0.000000   
51       0.401868  0.120099  0.041265  0.085403  0.286491  0.032437  0.001232   

                7         8         9        10        11        12        13  \
user_id                                                                         
2        0.000000  0.000154  0.000077  0.001079  0.001439  0.002364  0.000385   
4        0.000000  0.005406  0.001019  0.003456  0.000266  0.000177  0.000089   
16       0.000000  0.000000  0.000000  0.000660  0.000000  0.000000  0.000000   
50       0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
51       0.000513  0.012113  0.005235  0.003285  0.000924  0.006364  0.002772   

        group  
user_id        
2         l-1  
4         l-2  
16        l-2  
50        l-3  
51        l-4  

OK scratch the above, the following will be much faster:

result[cols]  = result[cols].div(result[cols].sum(axis=1), axis=0)

And just to prove the result is the same:

In [47]:

cols = ['0','1','2','3','4','5','6','7','8','9','10','11','12','13']
np.alltrue(result[cols].div(result[cols].sum(axis=1), axis=0) == result[cols].apply(lambda row: row / row.sum(axis=1), axis=1))
Out[47]:
True

And that it's faster:

In [48]:

cols = ['0','1','2','3','4','5','6','7','8','9','10','11','12','13']
%timeit result[cols].div(result[cols].sum(axis=1), axis=0) 
%timeit result[cols].apply(lambda row: row / row.sum(axis=1), axis=1)
100 loops, best of 3: 2.38 ms per loop
100 loops, best of 3: 4.47 ms per loop
EdChum
  • 376,765
  • 198
  • 813
  • 562
  • sorry i wanted to sum it all the way till column 13. there was a typo. – add-semi-colons Oct 23 '14 at 21:46
  • 1
    @Null-Hypothesis my original answer was incorrect, have corrected it now, I now apply row-wise dividing each item by the row sum, the values seem correct to me now – EdChum Oct 23 '14 at 21:52
  • without doing this there is no other way to mention all columns `cols = ['0','1','2','3','4','5','6','7','8','9','10','11','12','13']`×Comments may only be edited for 5 minutes×Comments may only be edited for 5 minutes×Comments may only be edited for 5 minutes – add-semi-colons Oct 23 '14 at 23:42
  • 1
    you can replace the cols with `result.loc[:,'0':'13']` – EdChum Oct 24 '14 at 07:23
  • Yeah @Nanashi this line did the job `df.loc[:,"value1":"value3"] = df.loc[:,"value1":"value3"].div(df.sum(axis=1),` – add-semi-colons Oct 24 '14 at 13:23
0

result.iloc[:,:-1].div(result.iloc[:,:-1].sum(axis=1), axis=0)

result.iloc[:,:-1] gets all rows and columns except last column

result.iloc[:,:-1].sum(axis=1) sums across a row due to axis=1, default is axis=0 i.e. column

result.div(result, axis=0) axis=0 because default for div is column i.e. axis=1

  • This answer does not seem unique since it contains elements already included in other answers – aaossa Mar 08 '22 at 18:56