Python Pandas: Doing some updates to DataFrame values while iterating over rows

Question

I'd like to update some column values when iterating over rows, but it took sooo long time. I use itertuples() instead of iterrows() as suggested here and here, and I can't use apply function since I want to update two columns in one iteration.

I'll use a simplified example, since my case involves 10-ish more columns unrelated to the code below. And including them here will make the code looks worse.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0, 5, size=(90000, 4)),  
                  columns=['Initial', 'A', 'B'])

df['code'] = list('KLMNOP' * 15000)  # Adding column 'code'

df = df.sort_values('code')  # Sorting the df by 'code'

df['Total'] = np.nan

Then, I'd like to update column Initial and Total based on the values of A and B, and then also updating the Initial by previous row's Total. I mean, Total is carried over to next row's Initial when the current code is equal to previous row's code

def produce_total(init, a, b):
    if a >= 2 and b >= 2:
        return init + 1
    return init

last_code = ''
last_total = -100
for row in df.itertuples():
    # Print the current checkpoint
    if(row.Index % 1000 == 0):
        print row.Index  

    # Carry over the prev Total to current Initial
    if last_code == row.code:
        df.loc[row.Index, 'Initial'] = last_total  

    # Prepare the updated Initial value
    # Because what is inside 'row' seems unaffected by the update
    new_initial = df.loc[row.Index, 'Initial']

    # Find Total and assign to the df
    new_total = produce_total(
        new_initial,
        row.A,
        row.B
    )
    df.loc[row.Index, 'Total'] = new_total

    last_code = row.code
    last_total = new_total

The code ran for almost an hour, but only reach Index 30000-ish. Any idea or suggestion for another or two efficient ways to do this?

Or, there are other aspects I need to consider (dropping some columns, etc)?

Thanks a lot!

score 0 · Answer 1 · answered Feb 09 '17 at 15:00

IIUC , you need to use pandas cumsum

This is what I assumed, for each code initial Total is always -100 , & when ever the value of A & B are greater than 2 , needs to add one to previous rows total.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0, 5, size=(90000, 3)),  
                  columns=['Initial', 'A', 'B'])
df['code'] = list('KLMNOP' * 15000)  # Adding column 'code'

df = df.sort_values('code')  # Sorting the df by 'code'

df['new_Initial'] = np.where((df.A>2 ) & (df.B>2 ) ,1,0) 

df.set_value(0, 'new_Initial', -100)
df.set_value(1, 'new_Initial', -100)
df.set_value(2, 'new_Initial', -100)
df.set_value(3, 'new_Initial', -100)
df.set_value(4, 'new_Initial', -100)

df['Total'] = df.groupby(['code']).new_Initial.cumsum()
print df

Output

       Initial  A  B code  new_Initial  Total
0            1  0  2    K         -100   -100
84312        4  1  2    K            0   -100
34110        1  4  0    K            0   -100
34104        2  0  4    K            0   -100
34098        0  4  3    K            1    -99
34092        4  1  0    K            0    -99
34086        2  2  4    K            0    -99
34080        1  2  2    K            0    -99
84318        4  2  2    K            0    -99
34074        2  3  2    K            0    -99
34116        2  1  1    K            0    -99
34068        4  3  0    K            0    -99
34056        4  3  4    K            1    -98
34050        2  4  1    K            0    -98
34044        1  1  0    K            0    -98
84324        1  0  2    K            0    -98
34038        0  1  0    K            0    -98
34032        1  2  0    K            0    -98
34026        0  1  1    K            0    -98
34020        0  4  4    K            1    -97
34014        0  0  4    K            0    -97
34062        4  0  3    K            0    -97
34122        2  3  3    K            1    -96
34128        1  1  1    K            0    -96
34134        3  2  3    K            0    -96
34242        0  1  3    K            0    -96
34236        4  3  2    K            0    -96
34230        4  3  1    K            0    -96
34224        4  2  0    K            0    -96
84294        2  3  2    K            0    -96
       ... .. ..  ...          ...    ...
51245        4  4  0    P            0   2355
51239        3  3  1    P            0   2355
51365        0  1  2    P            0   2355
51371        1  3  4    P            1   2356
51377        4  2  3    P            0   2356
51383        0  2  2    P            0   2356
51515        0  2  1    P            0   2356
51509        4  2  2    P            0   2356
51503        3  0  0    P            0   2356
51497        1  3  0    P            0   2356
51491        4  3  2    P            0   2356
51485        3  3  2    P            0   2356
51479        4  0  3    P            0   2356
51473        2  3  3    P            1   2357
51467        3  4  3    P            1   2358
51461        4  0  2    P            0   2358
51827        4  0  2    P            0   2358
51455        1  2  1    P            0   2358
51443        3  0  4    P            0   2358
51437        0  0  4    P            0   2358
51431        2  2  2    P            0   2358
51425        3  2  1    P            0   2358
51419        2  3  2    P            0   2358
51413        2  0  2    P            0   2358
51407        0  1  3    P            0   2358
51401        4  2  2    P            0   2358
51395        2  4  4    P            1   2359
51389        1  3  3    P            1   2360
51449        3  4  0    P            0   2360
89999        0  1  4    P            0   2360

Python Pandas: Doing some updates to DataFrame values while iterating over rows

1 Answers1