I have a pandas dataframe with about 800,000 rows. Some of the columns contains information that is updated only every hour, which corresponds to about 200 rows. Since it is the result of a manual industrial analysis, it may be missing once in a while, and the previous result is merely repeated.
However, if it is repeated for too long (say, 2000 rows), the information is not longer relevant for predicting the final output. So I would like to drop data only if the sequential repetition goes over the 2000-row threshold.
For the sake of visualizing the DataFrame, let's say I want to avoid sequential repetitions over 5 occurrences. So starting from the following dataframe:
Analy. Result A
0 105 # Sequential count: 1
1 105 # Sequential count: 2
2 105 # Sequential count: 3 >> ok, don't drop any
3 97
4 105
5 97 # Sequential count: 1
6 97 # Sequential count: 2
7 97 # Sequential count: 3
8 97 # Sequential count: 4
9 97 # Sequential count: 5
10 97 # Sequential count: 6 >> drop from here onwards
11 97
12 97
13 97
14 80
I would like to keep only the first 5 occurrences of a sequential repetition. In this case, dropping rows 10 to 13.
Analy. Result A
0 105
1 105
2 105
3 97
4 105
5 97
6 97
7 97
8 97
9 97
14 80
I took a look at some questions around here, but they don't quite fit the problem, like this one:
I'm thinking about creating several shifted columns and comparing them, but while that seems reasonable for a repetition of N=5, that also just seems a very slow and inefficient way for N in the thousands range.
Your time and help is appreciated!