1

I have a large time series df (2.5mil rows) that contain 0 values in a given row, some of which are legitimate. However if there are repeated continuous occurrences of zero values I would like to remove them from my df.

Example: Col. A contains [1,2,3,0,4,5,0,0,0,1,2,3,0,8,8,0,0,0,0,9] I would like to remove the [0,0,0] and [0,0,0,0] from the middle and leave the remaining 0 to make a new df [1,2,3,0,4,5,1,2,3,0,8,8,9].

The length of zero values before deletion being a parameter that has to be set - in this case > 2.

Is there a clever way to do this in pandas?

swifty
  • 1,182
  • 1
  • 15
  • 36

2 Answers2

2

It looks like you want to remove the row if it is 0 and either previous or next row in same column is 0. You can use shift to look for previous and next value and compare with current value as below:

result_df = df[~(((df.ColA.shift(-1) == 0) & (df.ColA == 0)) | ((df.ColA.shift(1) == 0) & (df.ColA == 0)))]
print(result_df)

Result:

    ColA
0      1
1      2
2      3
3      0
4      4
5      5
9      1
10     2
11     3
12     0
13     8
14     8
19     9

Update for more than 2 consecutive

Following example in link, adding new column to track consecutive occurrence and later checking it to filter:

# https://stackoverflow.com/a/37934721/5916727
df['consecutive'] = df.ColA.groupby((df.ColA != df.ColA.shift()).cumsum()).transform('size')
df[~((df.consecutive>10) & (df.ColA==0))]
niraj
  • 17,498
  • 4
  • 33
  • 48
  • I should have been clearer - I may want to set this parameter of continuous 0 values in the 10-100 range. – swifty Apr 09 '18 at 04:21
  • If taking your solution I want to check for zero in not only the one value previous&next but let's say the 10 previous & 10 next. If they all match 0, then remove all. I could do this with a 20 shift expression but there may be a less verbose way? – swifty Apr 09 '18 at 04:27
  • Great! `Happy Coding.` – niraj Apr 09 '18 at 05:13
0

We need build a new para meter here, then using drop_duplicates

df['New']=df.A.eq(0).astype(int).diff().ne(0).cumsum()
s=pd.concat([df.loc[df.A.ne(0),:],df.loc[df.A.eq(0),:].drop_duplicates(keep=False)]).sort_index()
s
Out[190]: 
    A  New
0   1    1
1   2    1
2   3    1
3   0    2
4   4    3
5   5    3
9   1    5
10  2    5
11  3    5
12  0    6
13  8    7
14  8    7
19  9    9

Explanation :

#df.A.eq(0) to find the value equal to 0 
#diff().ne(0).cumsum() if they are not equal to 0 then we would count them in same group .
BENY
  • 317,841
  • 20
  • 164
  • 234