0

I want to prepare a pd.DataFrame with data relating with machine maintenance. the data is based on time series. I want to clean my targets (df['entry'] in the example below) to only keep the first 2 elements of each start of patterns. I have a POC with pd.shift but it might miss some events (the last event in the example below). In the pd.DataFrame, I have 4 patterns starting. Any idea how to create a feature to clean my dataset and only keep the first nth elements of patterns ?

What I have so far:

df = pd.DataFrame({'entry':   [0,1,1,1,1,1,0,0,1,1,0,0,0,1,0,1,0],
                   'Expected':[0,1,1,0,0,0,0,0,1,1,0,0,0,1,0,1,0],
                   'comment': ['', 'keep', 'keep', 'drop', 'drop', 'drop', '', '', 'keep', 'keep', '', '', '', 'keep', '', 'How to get that one ?', '']})

df['shifted'] = df['entry'].shift(2).fillna(0)
def first(entry):
  return entry['entry']==1 and entry['shifted']==0
df['calculated'] = df.apply(first, axis=1)
df

below is what I get from my script, see the line before the last is calculated wrong (start of pattern missed)

entry   Expected    comment     shifted     calculated
0       0.0                     0.0         False
1       1           keep        0.0         True
1       1           keep        0.0         True
1       0           drop        1.0         False
1       0           drop        1.0         False
1       0           drop        1.0         False
0       0                       1.0         False
0       0                       1.0         False
1       1           keep        0.0         True
1       1           keep        0.0         True
0       0                       1.0         False
0       0                       1.0         False
0       0                       0.0         False
1       1           keep        0.0         True
0       0                       0.0         False
1       1           How to get that one ?   1.0     False
0       0                       0.0     False

Comments are welcome.

Laurent R
  • 783
  • 1
  • 6
  • 25
  • 2
    Please paste the expected output. That makes suggesting solution easier – moys Aug 24 '19 at 03:46
  • 1
    If you are looking to perform a `groupby` and then get the first `n` items from each group then you can [use](https://stackoverflow.com/a/20069379/4057186) `df.groupby(...).head(n)`. In your code you indicated *only keep the first 2 elements of groups*, however it does not appear that you are using a `groupby`. If you could please clarify (i) what you mean by keeping only the first 2 elements of groups and (ii) is your current code giving you your expected output, then that would help to better understand the question. Thanks. – edesz Aug 24 '19 at 04:14
  • what @edesz said ... i was just about to provide that as an answer ... but you have not given us the criteria of what a "group" is – Joran Beasley Aug 24 '19 at 04:20
  • group is not the right word, it's more patterns in a time serie. I will try to clarify. – Laurent R Aug 24 '19 at 04:20

2 Answers2

1

You can use groupby, cumsum and head:

df['Expected_1'] = df.groupby(df['entry'].diff().eq(1).cumsum())\
                     .head(2)['entry'].reindex(df.index, fill_value=0)

Output:

    Expected                comment  entry  Expected_1
0          0                             0           0
1          1                   keep      1           1
2          1                   keep      1           1
3          0                   drop      1           0
4          0                   drop      1           0
5          0                   drop      1           0
6          0                             0           0
7          0                             0           0
8          1                   keep      1           1
9          1                   keep      1           1
10         0                             0           0
11         0                             0           0
12         0                             0           0
13         1                   keep      1           1
14         0                             0           0
15         1  How to get that one ?      1           1
16         0                             0           0
​
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
0

Since you want to keep the rows where both 'entry' & 'expected' are same, will this work for you?

df1=df[df['entry']== df['Expected']]

the result is

entry   Expected    comment
0   0   
1   1   keep
1   1   keep
0   0   
0   0   
1   1   keep
1   1   keep
0   0   
0   0   
0   0   
1   1   keep
0   0   
1   1   How to get that one ?
0   0   

If you want to remove the rows where entry is 0 as well, you can use the code below

mask=df['entry'].ne(0)
df2=df[mask].loc[df['entry']==df['Expected']]

the result is

entry   Expected    comment
1   1   keep
1   1   keep
1   1   keep
1   1   keep
1   1   keep
1   1   How to get that one ?
moys
  • 7,747
  • 2
  • 11
  • 42