-1

I need to go through the TimeStamp column of a dataframe (row by row). This dataframe has approximately 40,000,000 rows. I'm doing this with for, it's working. However, it takes a long time. I wonder if you have something faster.

index   TimeStamp             FAILURE MESSAGE
0       2018-01-01 00:00:00   'DOOR OPEN'
1       2018-01-01 00:00:01   'DOOR OPEN'
2       2018-01-01 00:00:02   'DOOR OPEN'

Code:

cont = 0
for i in range(0, len(df)):
    if(df['TimeStamp'].iloc[i] >= '2018-01-01 00:00:01'):
        cont +=1
martineau
  • 119,623
  • 25
  • 170
  • 301
Jane Borges
  • 552
  • 5
  • 14
  • 3
    `len(df[df['TimeStamp'] >= pd.Timestamp('2018-01-01 00:00:01')].index)`. You should almost never iterate over a dataframe in pandas. – Brian Sep 23 '19 at 19:17

1 Answers1

1

I would do

(df['Timestamp'] >= pd.Timestamp('2018-01-1 00:00:01')).sum()

Pandas is optimized such that you generally don't need to loop over it

sedavidw
  • 11,116
  • 13
  • 61
  • 95
  • Don't use the built-in `sum` function on pandas / numpy data structures, that defeats the whole point, because that will essentially be a python level for-loop – juanpa.arrivillaga Sep 23 '19 at 21:01
  • Answer edited to use the series sum but the statement that the sum function is a for loop is incorrect, see https://stackoverflow.com/questions/24578896/python-built-in-sum-function-vs-for-loop-performance – sedavidw Sep 24 '19 at 13:01
  • i said 'essentially' because it still has to use a Python iterator, whereas `numpy`/`pandas` does not. – juanpa.arrivillaga Sep 24 '19 at 18:18
  • I haven't tested this personally but my understanding is that it's using optimized c code and not just a python iterator. That stackoverflow post suggests there's a pretty non trivial difference in performance between the two. So I don't think it's fait to make that claim – sedavidw Sep 24 '19 at 18:40