Better way to execute iterative function for very large dataset in python

Question

for i in range(1,len(df_raw)):
    if df_raw.loc[i-1, 'A']!= 0 & df_raw.loc[i, 'A']== 0 & df_raw.loc[i+1, 'A']== 0:
        df_raw.loc[i,'B'] = df_raw.loc[i+5,'B']

hi all, i m trying to run this above line of code on my data. till the time data is of range 100,000-150,000 rows , i am able to run this code but for bigger size of data it just keeps on running with no output. Can u pls hlp me with better way of writin this code for bigger data sizes.

Please explain the logic you are trying to do here (your code) so it'll be easier for people to solve what you want in a more efficient way. Also, providing a sample dataframe (even with 5 rows) will help understand your columns and your logic. — OmerM25, Jun 28 '21 at 08:14

TLouf · Answer 1 · 2021-07-21T07:56:17.120

I think the method you're missing which efficiently performs this kind of logic is shift. Here's my proposal:

df_raw = df_raw.sort_index() # Optional, if index is not sorted
df_raw['A_is_zero'] = df_raw['A'] == 0
df_raw['prev_A_is_zero'] = df_raw['A_is_zero'].shift(1).fillna(True)
df_raw['next_A_is_zero'] = df_raw['A_is_zero'].shift(-1).fillna(False)
B_to_change = df_raw['A_is_zero'] & df_raw['next_A_is_zero'] & ~df_raw['prev_A_is_zero']
df_raw.loc[B_to_change, 'B'] = df_raw['B'].shift(-5).loc[B_to_change]

Since you didn't provide a sample dataframe I didn't test it though, so I can't guarantee it'll work, but I think I provided the main idea to reach the solution. For instance in the four rows before the last, if B_to_change is True, you'll get NaNs in 'B'. One other thing is that you're using .loc with integers, but I didn't know if your index is a range, in which case my first line is useless, or if it's not and you meant to use iloc (see this link about the loc / iloc difference), in which case my first line should be removed because it would not lead to the expected result.

EDIT:

my requirements has some iterative conditional sequential operations, e.g.:
for i in range(1, len(df_raw)):
    if df_raw.loc[i, 'B'] != 0:
        df_raw.loc[i, 'A'] = df_raw.loc[i-1, 'A']

In this case (which you should have specified in your question), you can use forward filling as follows:

B_is_zero = df_raw['B'] == 0
df_raw['new_A'] = None
df_raw.loc[B_is_zero, 'new_A'] = df_raw.loc[B_is_zero, 'A'] 
df_raw['A'] = df_raw['new_A'].fillna(method='ffill')

Once again, you should be careful of how you handle the edge case where 'B' is nonzero on the first row.

thanx for the reply... but my requirements has some iterative conditional sequential operations which are not possible using "shift" method. for eg: ` for i in range(1,len(df_raw)):` `if df_raw.loc[i, 'B'] != 0:` `df_raw.loc[i,'A'] = df_raw.loc[i-1,'A'] ` — tausif shams, Jun 29 '21 at 10:23
@tausifshams note that this can still be vectorized. TLouf's updated `ffill` code is probably simplest, and it will be much faster than looping. — tdy, Jul 21 '21 at 08:11

score 0 · Accepted Answer · answered Jun 28 '21 at 08:24

It's possible that your code is just taking a long time to run because of the large number of steps it has to take. (more than 150,000). There are a few things I would recommend doing:

See if you need to be running the code for every one of the elements in your array. If not, this will dramatically improve performance.
Check top/task manager/system monitor (depending on operating system) and see if you've run out of ram.
Change out your bitwise and (&) for the more-idiomatic and faster (shortcircuiting) and
Profile your code
Add a progress bar:
At the command line: pip install tqdm
In your code

from tqdm import tqdm

for i in tqdm(range(1,len(df_raw))):
    if df_raw.loc[i-1, 'A'] != 0 and df_raw.loc[i, 'A'] == 0 and df_raw.loc[i+1, 'A']== 0:
        df_raw.loc[i,'B'] = df_raw.loc[i+5,'B']

Consider multiprocessing. If you can split the code up into descrete segments, you can parallelize it on a multi-core system. This can be difficult to do correctly, so I would start with the above steps. If you decide to go with this route and need help, edit your question with a more complete code sample.

thanx a lot for the reply... just a small change from '&' --> 'and' , has improved the speed a lot.... Aslo tqdm helps well in visualising the progress.... — tausif shams, Jun 28 '21 at 12:40

Better way to execute iterative function for very large dataset in python

2 Answers2