0

Suppose I have a large dataframe, with columns such as this:

Date | Person 1 | Person 2 | Value 1 |  Value 2
+----------------------------------------------+

Suppose the dataframe is sorted from the oldest to newest.

Now, I want to iterate over this dataframe.

  • For each row, I first look at Person 1. Call this cell Person_1_id

  • For Person 1, I want to take the most recent previous row, get Value 1, and perform a complex calculation.

The way I am currently getting the most recent Value 1 (v1) is:

value1s = df.loc[(df.ID1 == Person_1_id) & (df.Date < date)]
v1 = value1s.iloc[-1]

From what I understand, loc will loop and get all the previous values meeting the criteria.

Would it not be faster to simply loop backward up the dataframe, and pick the first row that meets the condition?

If so, how would one iterate backwards up the dataframe?

EDIT: An example:

My intial table:

DATE        Person 1    Person 2    value 1 value 2
13/08/2019  71          19          1000    1000
16/08/2019  19          68          1000    1000
19/08/2019  30          98          1000    1000
22/08/2019  42          32          1000    1000
25/08/2019  19          78          1000    1000

Algorithm:

  • Iterate over each row. Call the current row 'current_row'. All calculations will update this 'current_row' value 1
  • Get the Id of Person 1 (the number in the column 'Person 1'). Lets use person 19 as an example
  • Find the most recent previous row, 'most_recent_prev_row', in which person 19 appears, either in the 'Person 1' OR ' Person 2' column

Perform the following calculation:

flag = 0
if person in 'Person 1' then flag = 1
new_value = most_recent_prev_row['value 1'] + flag * 0.5 * (most_recent_prev_row['value 2']
current_row['Value 1'] = new_value

For example, updating the second row in the table above for person 19:

DATE        Person 1    Person 2    value 1               value 2
13/08/2019  19          71          1000                  1000
16/08/2019  19          68          1000+0.5*1000=1500    1000

If INSTEAD the first row had been:

DATE        Person 1    Person 2    value 1               value 2
13/08/2019  71          19          1000                  1000
16/08/2019  19          68          1000-0.5*1000=1500    1000

Finally, my calculation code is as follows. It is applied row by row and is incredibly slow:

# helper function to calculate new value
def calculate(value1, value2, flag):
   new_value = value1 + flag * 0.5 * value2

# function to update value
def updateValue(playerId, date):        
    # default value if player has no wins or losses
    score = 1000

    # get win and losses for the player. Players in 'Person 1' won, players in 'Person 2' lost.
    wins = df.loc[(df['Person 1'] == playerId) & (df.DATE < date)]
    losses = df.loc[(df['Person 2'] == playerId) & (df.DATE < date)]

    # player only has wins
    if not wins.empty and losses.empty:
        result_row = wins.iloc[-1]
        score = calculate(result_row.value1, result_row.value2, 1)

    # player only has losses
    if wins.empty and not losses.empty:
        result_row = losses.iloc[-1]
        score = calculate(result_row.value1, result_row.value2, 0)

    # player has wins and losses
    if not wins.empty and not losses.empty:        
        p1_win_row = wins.iloc[-1]
        p1_lost_row = losses.iloc[-1]

        result_row = pd.DataFrame()

        if p1_win_row.DATE < p1_lost_row.DATE:
            result_row = losses.iloc[-1]
            score = calculate(result_row.value1, result_row.value2, 0)
        else:
            result_row = wins.iloc[-1]
            score = calculate(result_row.value1, result_row.value2, 1)

    return score
eyllanesc
  • 235,170
  • 19
  • 170
  • 241
Convex Leopard
  • 121
  • 1
  • 12
  • I think .where is faster https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.where.html or run the benchmark yourself using https://stackoverflow.com/questions/2866380/how-can-i-time-a-code-segment-for-testing-performance-with-pythons-timeit – aunsid Sep 13 '19 at 21:24
  • can you add some sample data and the expected output? – MattR Sep 13 '19 at 21:39
  • Hi i added some more details and examples. I am basically trying to run a recursive calculation (which is why I call it complex), but I cannot figure out how to do it using optimized pandas functions. The only way I managed to apply this slow function is ittertuples – Convex Leopard Sep 13 '19 at 22:21

0 Answers0