4

I would like to make simple calculations on a rolling basis, but have heavy performance issues when I try to solve this with a nested for-loop. I need to perform this kind of operations on very large data, but have to use standard Python (incl. Pandas). The values are floats and can be negative, zero or positive.

I have a pd.DataFrame (df1) which contains (structured by some dimensions, lets call them key1 and key2) a start column, a end column and some operations-columns in between, which are supposed to be used to calculate the end column based on the start column.

Basically, the simple logic is: start + plus - minus = end, where the end value of each row is the start value of the next row.

This would need to be done by the two keys, i.e. for AX, AY and BX seperately.

df2 shows the desired result, but I don't know how to get there in an efficient way without blowing up my memory if this task is done on much larger tables.

import pandas as pd 
import numpy as np

df1 = pd.DataFrame(np.array([["A", "X", 3,6,4,0], ["A", "X", 0,2,10,0], ["A", "X", 0,9,3,0], ["A", "Y", 8,3,1,0], ["A", "Y", 0,2,3,0], ["B", "X", 4,4,2,0], ["B", "X", 0,1,0,0]]),
                   columns=['key1', 'key2', 'start', 'plus', 'minus', 'end'])

>>> df1
  key1 key2 start plus minus end
0    A    X     3    6     4   0
1    A    X     0    2    10   0
2    A    X     0    9     3   0
3    A    Y     8    3     1   0
4    A    Y     0    2     3   0
5    B    X     4    4     2   0
6    B    X     0    1     0   0
    

df2 = pd.DataFrame(np.array([["A", "X", 3,6,4,5], ["A", "X", 5,2,10,-3], ["A", "X", -3,9,3,3], ["A", "Y", 8,3,1,10], ["A", "Y", 10,2,3,9], ["B", "X", 4,4,2,2], ["B", "X", 2,1,0,3]]),
                   columns=['key1', 'key2', 'start', 'plus', 'minus', 'end'])

>>> df2
  key1 key2 start plus minus end
0    A    X     3    6     4   5
1    A    X     5    2    10  -3
2    A    X    -3    9     3   3
3    A    Y     8    3     1  10
4    A    Y    10    2     3   9
5    B    X     4    4     2   2
6    B    X     2    1     0   3
constiii
  • 638
  • 3
  • 19

4 Answers4

3

Code (+ some maths)

keys = ['key1', 'key2']

cs = df1.groupby(keys)[['plus', 'minus']].cumsum()
start = df1.groupby(keys)['start'].transform('first')

df1['end'] = start + cs['plus'] - cs['minus']

Result

  key1 key2  start  plus  minus  end
0    A    X      3     6      4    5
1    A    X      0     2      1    6
2    A    X      0     5      7    4
3    A    Y      8     3      1   10
4    A    Y      0     2      3    9
5    B    X      4     4      2    6
6    B    X      0     1      0    7

Explanation

Lets calculate the values for each row using the formula

end1 = `start1 + plus1 - minus1`
end2 = `end1 + plus2 - minus2` 
     = `start1 + (plus1 + plus2) - (minus1 + minus2)`
end3 = `end2 + plus3 - minus3`
     = `start1 + (plus1 + plus2 + plus3) - (minus1 + minus2 + minus3)`
....

If you observe the formula, there is a visible pattern which is that the end value is equal to the start value for the group plus the cumulative sum of the 'plus' rows minus the cumulative sum of the 'minus' rows

Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
  • It is working, but not if "end" becomes negative. I adjusted the example code for a negative case. Any idea how to adjust your solution to that? – constiii Apr 06 '23 at 16:41
  • Sorry, my bad, it is working also with negtaives - but the start values wont get updated - any idea how to include that? – constiii Apr 06 '23 at 16:49
  • Use the same logic: `df1['start'] = df1['end'] - df1['plus'] + df1['minus']` – Shubham Sharma Apr 06 '23 at 17:01
2

You can create helper Series for subtract plus and minus columns, create cumulative sums per groups by both columns and add first value of start for final end column, then for start column use DataFrameGroupBy.shift with replace first value by original values in Series.fillna:

plusminus = df1['plus'].sub(df1['minus'])
df1 = df1.assign(plusminus = plusminus)
g = df1.groupby(['key1','key2'])

df1['end'] = g['plusminus'].cumsum().add(g['start'].transform('first'))
df1['start'] = g['end'].shift().fillna(df1['start']).astype(int)
df1 = df1.drop('plusminus', axis=1)
print (df1)
  key1 key2  start  plus  minus  end
0    A    X      3     6      4    5
1    A    X      5     2      1    6
2    A    X      6     5      7    4
3    A    Y      8     3      1   10
4    A    Y     10     2      3    9
5    B    X      4     4      2    6
6    B    X      6     1      0    7
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

You can do the following with a combination of astype, df.iterrows() and a for loop:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.array([["A", "X", 3,6,4,0], ["A", "X", 0,2,1,0], ["A", "X", 0,5,7,0], ["A", "Y", 8,3,1,0], ["A", "Y", 0,2,3,0], ["B", "X", 4,4,2,0], ["B", "X", 0,1,0,0]]),
                   columns=['key1', 'key2', 'start', 'plus', 'minus', 'end'])

# Conver columns to integer
df[['start', 'plus', 'minus', 'end']] = df[['start', 'plus', 'minus', 'end']].astype(int)

# Start the row iterator
row_iterator = df.iterrows()
# take first item from row_iterator
_, last = next(row_iterator)
# Modify the first element
last['end'] = last['start'] + last['plus'] - last['minus']
df.loc[0, :] = last
# Iterate through the rest of the rows
for i, row in row_iterator:
    # Check the keys match
    if row['key1'] == last['key1'] and row['key2'] == last['key2']:
        # Add the end of last to the start of the next row
        row['start'] = last['end']
    # Caluculate new end for row
    row['end'] = row['start'] + row['plus'] - row['minus']
    # Ensure the changes are shown in the original dataframe
    df.loc[i, :] = row
    # Last row is now the current row
    last = row

After execution, df is now:

  key1 key2  start  plus  minus  end
0    A    X      3     6      4    5
1    A    X      5     2      1    6
2    A    X      6     5      7    4
3    A    Y      8     3      1   10
4    A    Y     10     2      3    9
5    B    X      4     4      2    6
6    B    X      6     1      0    7

Note: There is an error in your df2. The entry for start of row 5 should be 4 not 0, if we're following the logic you provided.

Marcelo Paco
  • 2,732
  • 4
  • 9
  • 26
  • The numbers can be of type float, hence your solution wont work - an idea how to adjust? – constiii Apr 06 '23 at 16:43
  • @constiii, instead of using `int` in the `astype()` call, you can use `float`. Like so: `df[['start', 'plus', 'minus', 'end']] = df[['start', 'plus', 'minus', 'end']].astype(float)` – Marcelo Paco Apr 06 '23 at 16:44
1

Since apply method works row-after-row, it can be used as follows without any for loops:

# create your df1
df1 = pd.DataFrame(np.array([["A", "X", 3,6,4,0], ["A", "X", 0,2,1,0], ["A", "X", 0,5,7,0], ["A", "Y", 8,3,1,0], ["A", "Y", 0,2,3,0], ["B", "X", 4,4,2,0], ["B", "X", 0,1,0,0]]),
                   columns=['key1', 'key2', 'start', 'plus', 'minus', 'end'])

df1[['start','plus','minus','end']] = df1[['start','plus','minus','end']].astype(np.int32)


# a dictionary to follow-up keys and start value
d = {'AX': None,
     'AY': None,
     'BX': None}

def helper(row):
    # modify d inside this function
    global d
    # get key by concatenating key1+key2
    key = row.key1+row.key2
    # if key is already seen, use the stored value as start value
    if d[key]:
        start = d[key]
    # if key is unseen, use the df1 start value
    else:
        start=row.start
    
    # calculate end value
    end = start + row.plus - row.minus
    
    # store the end value in dictionary
    # so that it can be used as start in next corresponding row
    d[key] = end
    # update
    return start,end

# update df1 start and end row-wise
df1[['start','end']] = df1.apply(helper,axis=1,result_type='expand')

In the end, updated df1 is equivalent to your df2.

rajkumar_data
  • 386
  • 2
  • 7