3

I need to make an inplace value replacement based upon an row index value. The replacement value is a sliced (both row and column) dataframe computation.

Setup

In [1]: import pandas as pd                                                                                                  

In [2]: cols = [0, 1, 'A0', 'A1', 'A2', 'A3', 'B0', 'B1', 'B2', 'B3']                                                        

In [3]: data = [['sum', 4531.0010, 0, 0, 0, 2, 0, 0, 0, 7], 
   ...:         ['', 4531.0010, 5, 6, 3, 0, 5, 4, 7, 0], 
   ...:         ['', 4531.0010, 1, 3, 9, 0, 2, 2, 3, 0], 
   ...:         ['sum', 5037.0022, 0, 0, 0, 8, 0, 0, 0, 5], 
   ...:         ['', 5037.0022, 2, 2, 3, 0, 1, 3, 9, 0], 
   ...:         ['', 5037.0022, 5, 4, 7, 0, 5, 6, 3, 0]]                                                                     

In [4]: df = pd.DataFrame(data=data, columns=cols)                                                                           

In [5]: df = df.set_index(list(df.columns[[0, 1]]))                                                                          

In [6]: df                                                                                                                   
Out[6]: 
               A0  A1  A2  A3  B0  B1  B2  B3
0   1                                        
sum 4531.0010   0   0   0   2   0   0   0   7
    4531.0010   5   6   3   0   5   4   7   0
    4531.0010   1   3   9   0   2   2   3   0
sum 5037.0022   0   0   0   8   0   0   0   5
    5037.0022   2   2   3   0   1   3   9   0
    5037.0022   5   4   7   0   5   6   3   0

As you can see the row is multiindexed with index=1 is a number that represents a subset of the data. Within each data subset there is a "sum" in index=0 which I would like to "distribute" up (or down) to the zero balances.

The computation is basically the "A" columns and the index 1 rows with the same value to be summed to a denominator. Then the sum of the row for that data group is the numerator. The ratio is then used to distribute the sum between the rows.

For the rows=4531.0010 and the columns with an A it would be computed as:

(5+6+3)/(5+6+3+1+3+9)*2 = row 1, column A3
(1+3+9)/(5+6+3+1+3+9)*2 = row 2, column A3

The resulting df would look like this:

Out[7]: 
               A0  A1  A2     A3  B0  B1  B2     B3
0   1                                              
sum 4531.0010   0   0   0  2.000   0   0   0  7.000
    4531.0010   5   6   3  1.037   5   4   7  4.870
    4531.0010   1   3   9  0.923   2   2   3  2.130
sum 5037.0022   0   0   0  8.000   0   0   0  5.000
    5037.0022   2   2   3  2.435   1   3   9  2.407
    5037.0022   5   4   7  5.565   5   6   3  2.593

The number of rows are not fixed - there could be one or there could be 10.

What I've tried

I've tried using variations of .pivot_table() but I can't figure out how to reverse the process using division. As an example.

I've also used variations of .sum() but trying to constrain the df using slices eludes me. One of many examples.

I think I can get this to work with a lot of python functions, but seems that it should be possible more efficiently. Any direction is greatly appreciated.

Bill Armstrong
  • 1,615
  • 3
  • 23
  • 47

1 Answers1

2

Solution working, if unique first levels of MultiIndex:

cols = [0, 1, 'A0', 'A1', 'A2', 'A3', 'B0', 'B1', 'B2', 'B3']

data = [['sum1', 4531.0010, 0, 0, 0, 2, 0, 0, 0, 7], 
        ['sum1', 4531.0010, 5, 6, 3, 0, 5, 4, 7, 0], 
        ['sum1', 4531.0010, 1, 3, 9, 0, 2, 2, 3, 0], 
        ['sum2', 5037.0022, 0, 0, 0, 8, 0, 0, 0, 5], 
        ['sum2', 5037.0022, 2, 2, 3, 0, 1, 3, 9, 0], 
        ['sum2', 5037.0022, 5, 4, 7, 0, 5, 6, 3, 0]]

df = pd.DataFrame(data=data, columns=cols)

df = df.set_index(list(df.columns[[0, 1]]))
print (df)
                A0  A1  A2  A3  B0  B1  B2  B3
0    1                                        
sum1 4531.0010   0   0   0   2   0   0   0   7
     4531.0010   5   6   3   0   5   4   7   0
     4531.0010   1   3   9   0   2   2   3   0
sum2 5037.0022   0   0   0   8   0   0   0   5
     5037.0022   2   2   3   0   1   3   9   0
     5037.0022   5   4   7   0   5   6   3   0

#loop by first letters of values in columns
for c in df.columns.str[0].unique():
    #filter values by first letter
    df1 = df.filter(like=c)
    #get sum per rows
    s = df1.iloc[:, :-1].sum(axis=1)
    #get last column
    last_col = df1.iloc[:, -1]
    #replace 0 in last column to previous non 0
    last_col = last_col.mask(last_col == 0).ffill()
    #divide by sum per first level with multiple by last_col
    s = s.div(s.sum(level=0), level=0).mul(last_col)
    #add to last column
    df[last_col.name] += s
print (df)
                A0  A1  A2        A3  B0  B1  B2        B3
0    1                                                    
sum1 4531.0010   0   0   0  2.000000   0   0   0  7.000000
     4531.0010   5   6   3  1.037037   5   4   7  4.869565
     4531.0010   1   3   9  0.962963   2   2   3  2.130435
sum2 5037.0022   0   0   0  8.000000   0   0   0  5.000000
     5037.0022   2   2   3  2.434783   1   3   9  2.407407
     5037.0022   5   4   7  5.565217   5   6   3  2.592593
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Index 0 is not unique - - but, if I reset the row index and concatenate index 0 & 1 into new column that maybe could yield the same result? – Bill Armstrong Sep 04 '19 at 05:20
  • @BillArmstrong - Not sure how in real data, but how is distingush groups? Are similar like in change your data sample or unique like in question? – jezrael Sep 04 '19 at 05:27
  • 1
    The row indexing is actually 5 deep - so, far more complex. But I was able to simply add a new reference column based on your answer and it works. I think I can get the iterator to use the multiindex depth and apply the .unique() and get there in one less step. Thanks. – Bill Armstrong Sep 04 '19 at 05:41