is there away to make loop on huge data faster?

Question

i have data (pandas data frame) with 10 millions row ,this code using for loop on data using google colab but when i perform it it is very slow . is there away to use faster loop with these multiple statements (like np.where) or other solve?? i need help for rewrite this code in another way (like using np.where) or other to solve this problem

the code are:

'''

`for  i in  range(0,len(data)):
      last=data.head(i)
      select_acc = last.loc[last['ACOUNTNO']==data['ACOUNTNO'][i]] 
      avr= select_acc[ (select_acc['average']>0)]    
      if len(avr)==0:
      lastavrage=0
      else:
      lastavrage = avr.average.mean()

      if  (data["average"][i]<lastavrage) and (data['LASTREAD'][i]> 0):
       data["label"][i]="abnormal"
       data["problem"][i]="error"
`

but my project in python , so i need another solve for speed up loop — ii mm, Nov 08 '22 at 19:33
What is `data`? a Pandas DataFrame? Please [edit] to add all the necessary details and fix the indenting as well as the formatting. For more info, see [mre] and [code formatting help](/editing-help#code) as well as [How to make good reproducible pandas examples](/q/20109391/4518341). You might also want to read [How to ask a good question](/help/how-to-ask). — wjandrea, Nov 08 '22 at 19:34
Wie possibly, it can be faster. But would like include changing the structure of the `data` into something more suitable for fast performance. Learning about numpy would be a good idea, yes. — zvone, Nov 08 '22 at 19:37
Take a look at [`pyspark`](https://pypi.org/project/pyspark/). There is also this [article](https://medium.com/geekculture/simple-tricks-to-speed-up-pandas-by-100x-3b7e705783a8) on Medium. — accdias, Nov 08 '22 at 19:39
it make some comparison and condition to set label column with "normal" or "abnormal" — ii mm, Nov 08 '22 at 19:45

chrslg · Accepted Answer · 2022-11-08T21:33:23.833

Generally speaking, the worst thing to do is to iterate rows.

I can't see a totally iteration free solution (by "iteration free" I mean, "without explicit iterations in python". Of course, any solution would have iterations anyway. But some may have iterations made under the hood, by the internal code of pandas or numpy, which are way faster).

But you could at least try to iterate over account numbers rather than rows (there are certainly less account numbers than rows. Otherwise you wouldn't need those computation any way).

For example, you could compute the threshold of "abnormal" average like this

for no in data.ACCOUNTNO.unique():
    f=data.ACCOUNTNO==no # True/False series of rows matching this account
    cs=data[f].average.cumsum() # Cumulative sum of 'average' column for this account
    num=f.cumsum() # Numerotation of rows for this account
    data.loc[f, 'lastavr']=cs/num

After that, column 'lastavr' contains what your variable lastaverage would worth in your code. Well, not exactly: your variable doesn't count current row, while mine does. We could have computed (cs-data.average)/(num-1) instead of cs/num to have it your way. But what for? The only thing you do with this is compare to current df.average. And data.average>(cs-data.average)/(num-1) iff data.average>cs/num. So it is simpler that way, and it avoids special case for 1st row

Then, once you have that new column (you could also just use a series, without adding it as a column. A little bit like I did for cs and num which are not columns of data), it is simply a matter of

pb = (data.average<data.lastavr) & (data.LASTREAD>0)
data.loc[pb,'label']='abnormal'
data.loc[pb,'problem']='error'

Note that the fact that I don't have a way to avoid the iteration over ACCOUNTNO, doesn't mean that there isn't one. In fact, I am pretty sure that with lookup or some combination of join/merge/groupby there could be one. But it probably doesn't matter much, because you have probably way less ACCOUNTNO than you have rows. So my remaining loop is probably negligible.

thanks for explanation , i will try it , its seem to be usefull — ii mm, Nov 13 '22 at 19:02

is there away to make loop on huge data faster?

1 Answers1