How to run dataframe faster than "for"?

Question

I need to go through the TimeStamp column of a dataframe (row by row). This dataframe has approximately 40,000,000 rows. I'm doing this with for, it's working. However, it takes a long time. I wonder if you have something faster.

index   TimeStamp             FAILURE MESSAGE
0       2018-01-01 00:00:00   'DOOR OPEN'
1       2018-01-01 00:00:01   'DOOR OPEN'
2       2018-01-01 00:00:02   'DOOR OPEN'

Code:

cont = 0
for i in range(0, len(df)):
    if(df['TimeStamp'].iloc[i] >= '2018-01-01 00:00:01'):
        cont +=1

`len(df[df['TimeStamp'] >= pd.Timestamp('2018-01-01 00:00:01')].index)`. You should almost never iterate over a dataframe in pandas. — Brian, Sep 23 '19 at 19:17

sedavidw · Answer 1 · 2019-09-24T13:00:50.170

1

I would do

(df['Timestamp'] >= pd.Timestamp('2018-01-1 00:00:01')).sum()

Pandas is optimized such that you generally don't need to loop over it

edited Sep 24 '19 at 13:00

answered Sep 23 '19 at 19:20

sedavidw

11,116
13
61
95

Don't use the built-in `sum` function on pandas / numpy data structures, that defeats the whole point, because that will essentially be a python level for-loop – juanpa.arrivillaga Sep 23 '19 at 21:01
Answer edited to use the series sum but the statement that the sum function is a for loop is incorrect, see https://stackoverflow.com/questions/24578896/python-built-in-sum-function-vs-for-loop-performance – sedavidw Sep 24 '19 at 13:01
i said 'essentially' because it still has to use a Python iterator, whereas `numpy`/`pandas` does not. – juanpa.arrivillaga Sep 24 '19 at 18:18
I haven't tested this personally but my understanding is that it's using optimized c code and not just a python iterator. That stackoverflow post suggests there's a pretty non trivial difference in performance between the two. So I don't think it's fait to make that claim – sedavidw Sep 24 '19 at 18:40

How to run dataframe faster than "for"?

1 Answers1