Suppose I have a large dataframe, with columns such as this:
Date | Person 1 | Person 2 | Value 1 | Value 2
+----------------------------------------------+
Suppose the dataframe is sorted from the oldest to newest.
Now, I want to iterate over this dataframe.
For each row, I first look at Person 1. Call this cell Person_1_id
For Person 1, I want to take the most recent previous row, get Value 1, and perform a complex calculation.
The way I am currently getting the most recent Value 1 (v1) is:
value1s = df.loc[(df.ID1 == Person_1_id) & (df.Date < date)]
v1 = value1s.iloc[-1]
From what I understand, loc will loop and get all the previous values meeting the criteria.
Would it not be faster to simply loop backward up the dataframe, and pick the first row that meets the condition?
If so, how would one iterate backwards up the dataframe?
EDIT: An example:
My intial table:
DATE Person 1 Person 2 value 1 value 2
13/08/2019 71 19 1000 1000
16/08/2019 19 68 1000 1000
19/08/2019 30 98 1000 1000
22/08/2019 42 32 1000 1000
25/08/2019 19 78 1000 1000
Algorithm:
- Iterate over each row. Call the current row 'current_row'. All calculations will update this 'current_row' value 1
- Get the Id of Person 1 (the number in the column 'Person 1'). Lets use person 19 as an example
- Find the most recent previous row, 'most_recent_prev_row', in which person 19 appears, either in the 'Person 1' OR ' Person 2' column
Perform the following calculation:
flag = 0
if person in 'Person 1' then flag = 1
new_value = most_recent_prev_row['value 1'] + flag * 0.5 * (most_recent_prev_row['value 2']
current_row['Value 1'] = new_value
For example, updating the second row in the table above for person 19:
DATE Person 1 Person 2 value 1 value 2
13/08/2019 19 71 1000 1000
16/08/2019 19 68 1000+0.5*1000=1500 1000
If INSTEAD the first row had been:
DATE Person 1 Person 2 value 1 value 2
13/08/2019 71 19 1000 1000
16/08/2019 19 68 1000-0.5*1000=1500 1000
Finally, my calculation code is as follows. It is applied row by row and is incredibly slow:
# helper function to calculate new value
def calculate(value1, value2, flag):
new_value = value1 + flag * 0.5 * value2
# function to update value
def updateValue(playerId, date):
# default value if player has no wins or losses
score = 1000
# get win and losses for the player. Players in 'Person 1' won, players in 'Person 2' lost.
wins = df.loc[(df['Person 1'] == playerId) & (df.DATE < date)]
losses = df.loc[(df['Person 2'] == playerId) & (df.DATE < date)]
# player only has wins
if not wins.empty and losses.empty:
result_row = wins.iloc[-1]
score = calculate(result_row.value1, result_row.value2, 1)
# player only has losses
if wins.empty and not losses.empty:
result_row = losses.iloc[-1]
score = calculate(result_row.value1, result_row.value2, 0)
# player has wins and losses
if not wins.empty and not losses.empty:
p1_win_row = wins.iloc[-1]
p1_lost_row = losses.iloc[-1]
result_row = pd.DataFrame()
if p1_win_row.DATE < p1_lost_row.DATE:
result_row = losses.iloc[-1]
score = calculate(result_row.value1, result_row.value2, 0)
else:
result_row = wins.iloc[-1]
score = calculate(result_row.value1, result_row.value2, 1)
return score