I'd like to update some column values when iterating over rows, but it took sooo long time. I use itertuples()
instead of iterrows()
as suggested here and here, and I can't use apply
function since I want to update two columns in one iteration.
I'll use a simplified example, since my case involves 10-ish more columns unrelated to the code below. And including them here will make the code looks worse.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, 5, size=(90000, 4)),
columns=['Initial', 'A', 'B'])
df['code'] = list('KLMNOP' * 15000) # Adding column 'code'
df = df.sort_values('code') # Sorting the df by 'code'
df['Total'] = np.nan
Then, I'd like to update column Initial
and Total
based on the values of A
and B
, and then also updating the Initial
by previous row's Total
.
I mean, Total
is carried over to next row's Initial
when the current code
is equal to previous row's code
def produce_total(init, a, b):
if a >= 2 and b >= 2:
return init + 1
return init
last_code = ''
last_total = -100
for row in df.itertuples():
# Print the current checkpoint
if(row.Index % 1000 == 0):
print row.Index
# Carry over the prev Total to current Initial
if last_code == row.code:
df.loc[row.Index, 'Initial'] = last_total
# Prepare the updated Initial value
# Because what is inside 'row' seems unaffected by the update
new_initial = df.loc[row.Index, 'Initial']
# Find Total and assign to the df
new_total = produce_total(
new_initial,
row.A,
row.B
)
df.loc[row.Index, 'Total'] = new_total
last_code = row.code
last_total = new_total
The code ran for almost an hour, but only reach Index 30000-ish. Any idea or suggestion for another or two efficient ways to do this?
Or, there are other aspects I need to consider (dropping some columns, etc)?
Thanks a lot!