Looking for faster loops for big dataframes

Question

I have a very simple loop that just takes too long to iterate over my big dataframe.

value = df.at[n,'column_A']

for y in range(0,len(df)):
    index=df[column_B.ge(value_needed)].index[y]
    if index_high  > n:
        break

With this, I'm trying to find the first index that has a value greater than value_needed. The problem is that this loop is just too inneficent to run when len(df)>200000

Any ideas on how to solve this issue?

do u mind sharing a sample dataset and explaining ur usecase with it? — sammywemmy, Apr 14 '20 at 23:46
please show a example of your dataframe and your expected output..i think we don't need loop here, loops are very slow with pandas — ansev, Apr 14 '20 at 23:50
maybe you only need sort_values... and get previous index after sort — ansev, Apr 14 '20 at 23:52
[You shouldnt iterate over dataframes](https://stackoverflow.com/a/55557758/9081267) for simple operations, use the pandas / numpy methods. — Erfan, Apr 14 '20 at 23:55
why don't you simpy slice ? if you want to cut at a certain index value — RABI Hamza, Apr 15 '20 at 00:03

fmarm · Accepted Answer · 2020-04-14T23:48:05.240

1

In general you should try to avoid loops with pandas, here is a vectorized way to get what you want:

df.loc[(df['column_B'].ge(value_needed)) & (df.index > n)].index[0]

edited Apr 14 '20 at 23:48

answered Apr 14 '20 at 23:35

fmarm

4,209
1
17
29

unfortunately, this code still doesn't do what I need. I can't just that the first value that suits my criteria (index[0]). I need to increase the values inside the brackets to find the next value after `n`. That's why I was using a loop – Nycolas Mancini Apr 14 '20 at 23:40
I have added a `df.index > n` condition inside `loc`, this should work now – fmarm Apr 14 '20 at 23:48
That was it, chief! Thanks a lot for the simple solution. – Nycolas Mancini Apr 15 '20 at 12:38

score 1 · Answer 2 · answered Apr 14 '20 at 23:51

1

I wish you have sample data. Try this on your data and let me know what you get

import numpy as np
index = np.where(df[column_B] > value_needed)[0].flat[0]

Then

#continue with other logic

answered Apr 14 '20 at 23:51

pi_pascal

202
2
8

Looking for faster loops for big dataframes

2 Answers2