2

I have a pandas dataframe with about 800,000 rows. Some of the columns contains information that is updated only every hour, which corresponds to about 200 rows. Since it is the result of a manual industrial analysis, it may be missing once in a while, and the previous result is merely repeated.

However, if it is repeated for too long (say, 2000 rows), the information is not longer relevant for predicting the final output. So I would like to drop data only if the sequential repetition goes over the 2000-row threshold.

For the sake of visualizing the DataFrame, let's say I want to avoid sequential repetitions over 5 occurrences. So starting from the following dataframe:

     Analy. Result A
0               105   # Sequential count: 1
1               105   # Sequential count: 2
2               105   # Sequential count: 3 >> ok, don't drop any
3               97
4               105
5               97   # Sequential count: 1
6               97   # Sequential count: 2
7               97   # Sequential count: 3
8               97   # Sequential count: 4
9               97   # Sequential count: 5 
10              97   # Sequential count: 6 >> drop from here onwards
11              97 
12              97
13              97    
14              80    

I would like to keep only the first 5 occurrences of a sequential repetition. In this case, dropping rows 10 to 13.

     Analy. Result A
0               105
1               105
2               105
3               97
4               105
5               97
6               97 
7               97 
8               97
9               97   
14              80    

I took a look at some questions around here, but they don't quite fit the problem, like this one:

I'm thinking about creating several shifted columns and comparing them, but while that seems reasonable for a repetition of N=5, that also just seems a very slow and inefficient way for N in the thousands range.

Your time and help is appreciated!

leo.barddal
  • 118
  • 5

1 Answers1

1

I suggest you take the difference of successive rows. If you create a column to check whether this difference is zero (True) or non-zero (False), then you can take a cumulative sum to get a group for each duplicate value.

Then, you can enumerate the readings within each group and threshold the ones which occur too frequently.

import pandas as pd
df = pd.DataFrame([105,105,105,97,105,97,97,97,97,97,97,97,97,97,80],columns = ['Analy. Result A'])

# Get a unique group for each set of consequetive readings
df['dup_group'] = (df.diff().fillna(1)!=0).cumsum()

# Enumerate each reading within each group (from 0)
df['dup_count'] = df.groupby('dup_group').apply(lambda x: pd.Series(range(x.shape[0]))).values

# Threshold so that only the first 5 readings within each group are considered
max_dup = 5
df = df[df['dup_count']<max_dup]
nathan.j.mcdougall
  • 475
  • 1
  • 6
  • 12
  • Thanks a lot! At some point I thought about using ```groupby``` but I discarded the possibility immediately because I didn't think about getting *unique groups for every sequence*. Simple and clever solution :) Exactly what I need – leo.barddal Jul 20 '20 at 01:51