1

Create the output dataframe from input, filter for rows when target == 1 for the first time for each id, or in order words removing consecutive occurrence for each ids where target is 1 however keep all 0s in target before target = 1 and keep group of ids where we don't have 1 e.g a0

Input

df = pd.DataFrame({'ID': ['a0','a0','a0','a1', 'a1', 'a1', 'a1', 'a1', 'a2', 'a2', 'a2', 'a2'],
 'date': [ '2019-11-01',
  '2019-12-01',
  '2020-01-01',
  '2019-11-01',
  '2019-12-01',
  '2020-01-01',
  '2020-02-01',
  '2020-03-01',
  '2019-11-01',
  '2019-12-01',
  '2020-03-01',
  '2020-04-01'],
 'target': [0,0,0,0, 0, 1, 1, 0, 0, 1, 0, 1]})

Output

ID   date         target
a0   2019-11-01   0
a0   2019-12-01   0
a0   2020-01-01   0
a1   2019-11-01   0
a1   2019-12-01   0
a1   2020-01-01   1
a2   2019-11-01   0
a2   2019-12-01   1
Priyank
  • 144
  • 10
  • I don't really get your query but you can use df.where(condition) to hopefully solve it – Aryan Garg Mar 01 '21 at 07:34
  • So: *"for each ID, sort by date then drop all rows after the first occurrence of 'target==1' (if any)"*? – smci Mar 01 '21 at 07:58

3 Answers3

2

sort dataframe first.

df.sort_values(['ID', 'date'], inplace=True)

# use cumsum to calculate how many times the target eq 1
df['tag'] = df['target'] == 1 
df['tag'] = df.groupby('ID')['tag'].cumsum()

# for every group use shift(1) to include the first 1
df['tag2'] = df.groupby('ID')['tag'].shift(1).fillna(0)
cond = df['tag2'] == 0
df[cond]

result:

   ID        date  target  tag  tag2
0  a0  2019-11-01       0  0.0   0.0
1  a0  2019-12-01       0  0.0   0.0
2  a0  2020-01-01       0  0.0   0.0
3  a1  2019-11-01       0  0.0   0.0
4  a1  2019-12-01       0  0.0   0.0
5  a1  2020-01-01       1  1.0   0.0
8  a2  2019-11-01       0  0.0   0.0
9  a2  2019-12-01       1  1.0   0.0

df:

   ID        date  target  tag  tag2
0   a0  2019-11-01       0  0.0   0.0
1   a0  2019-12-01       0  0.0   0.0
2   a0  2020-01-01       0  0.0   0.0
3   a1  2019-11-01       0  0.0   0.0
4   a1  2019-12-01       0  0.0   0.0
5   a1  2020-01-01       1  1.0   0.0
6   a1  2020-02-01       1  2.0   1.0
7   a1  2020-03-01       0  2.0   2.0
8   a2  2019-11-01       0  0.0   0.0
9   a2  2019-12-01       1  1.0   0.0
10  a2  2020-03-01       0  1.0   1.0
11  a2  2020-04-01       1  2.0   1.0
Ferris
  • 5,325
  • 1
  • 14
  • 23
1

Nice question. I believe Ferris' answer is probably the elegant and computationally efficient way to do it. Another intuitive way is to think about using "apply" function on a dataframe to generate the index of for each group in the sorted dataframe, upto which row we should include in the output.

df["ind"]=df.index

upto_id_index = df.groupby("ID").apply(lambda x: x[(x["target"]==1)]["ind"].min() if (x["target"].sum()>0) else x["ind"].max())

df[df.apply(lambda x: x["ind"]<= upto_id_index.loc[x["ID"]], axis=1)]
supercooler8
  • 503
  • 2
  • 7
0

Using np.argmax to get index of first elements works but how to keep all rows with 0s, where for each id there is no target = 1 Previous post with a different dataset (Pandas advanced groupby and filter by date)

Priyank
  • 144
  • 10