4

I have one dataframe as below. At first,they have three columns('date','time','flag'). I want to add one column which based on the flag and date which means when I get flag=1 ,then the rest of this day the target is 1, otherwise the target is zero.

  date        time      flag  target
0 2017/4/10   10:00:00  0     0
1 2017/4/10   11:00:00  1     1
2 2017/4/10   12:00:00  0     1
3 2017/4/10   13:00:00  0     1
4 2017/4/10   14:00:00  0     1
5 2017/4/11   10:00:00  1     1
6 2017/4/11   11:00:00  0     1
7 2017/4/11   12:00:00  1     1
8 2017/4/11   13:00:00  1     1
9 2017/4/11   14:00:00  0     1
10 2017/4/12  10:00:00  0     0
11 2017/4/12  11:00:00  0     0
12 2017/4/12  12:00:00  0     0
13 2017/4/12  13:00:00  0     0
14 2017/4/12  14:00:00  0     0
15 2017/4/13  10:00:00  0     0
16 2017/4/13  11:00:00  1     1
17 2017/4/13  12:00:00  0     1
18 2017/4/13  13:00:00  1     1
19 2017/4/13  14:00:00  0     1
Hong
  • 263
  • 2
  • 8
  • Have you read: https://stackoverflow.com/questions/12555323/adding-new-column-to-existing-dataframe-in-python-pandas – knh190 Jan 10 '19 at 09:21
  • 1
    @knh190 I think you should read the question better. It isn't as simple as what you linked. – ycx Jan 10 '19 at 09:24

2 Answers2

3

Use DataFrameGroupBy.cumsum for cumulative sum flag values, compare with 0 and last cast mask to integer:

df['new'] = (df.groupby('date')['flag'].cumsum() > 0).astype(int)
print (df)
         date      time  flag  target  new
0   2017/4/10  10:00:00     0       0    0
1   2017/4/10  11:00:00     1       1    1
2   2017/4/10  12:00:00     0       1    1
3   2017/4/10  13:00:00     0       1    1
4   2017/4/10  14:00:00     0       1    1
5   2017/4/11  10:00:00     1       1    1
6   2017/4/11  11:00:00     0       1    1
7   2017/4/11  12:00:00     1       1    1
8   2017/4/11  13:00:00     1       1    1
9   2017/4/11  14:00:00     0       1    1
10  2017/4/12  10:00:00     0       0    0
11  2017/4/12  11:00:00     0       0    0
12  2017/4/12  12:00:00     0       0    0
13  2017/4/12  13:00:00     0       0    0
14  2017/4/12  14:00:00     0       0    0
15  2017/4/13  10:00:00     0       0    0
16  2017/4/13  11:00:00     1       1    1
17  2017/4/13  12:00:00     0       1    1
18  2017/4/13  13:00:00     1       1    1
19  2017/4/13  14:00:00     0       1    1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Simply fantastic use of `True` to `1` implementation of Python's inherent interpreted design decision and almost casual usage of `groupby`'s inherent ascending sort. The one possible part where it might screw up is if the time was not already sorted, however. :) – ycx Jan 10 '19 at 09:37
  • @ycx - Thank you :) – jezrael Jan 10 '19 at 09:37
  • 1
    Nice one @jezrael :-) Another option is to use `gt(0)` here, great solution – yatu Jan 10 '19 at 10:08
  • @yatu - Thank you. – jezrael Jan 10 '19 at 10:08
  • Is that a typo or is the value for df['new'] at row index 0 really 0? Edit: Okay, it's really 0 due to cumsum. I guess this chink in the armor is still easy to live with. – kerwei Jan 11 '19 at 10:27
  • 1
    @jezrael Yeah, just need to add in one more line to backfill the first row based on the value in the second row if the dates are the same – kerwei Jan 11 '19 at 10:29
  • @jezrael Thanks. But I have another problem. If the flag is 1(first appear in one day), then the target2 is 1, the rest of the day is 0. How can I make that code? Thanks in advance. – Hong Jan 16 '19 at 08:42
  • @Hong - So e.g. if change forst line of data from `0 2017/4/10 10:00:00 0` to `0 2017/4/10 10:00:00 1` then output is not `1` for `2017/4/10` day? Not sure if understand. – jezrael Jan 16 '19 at 08:45
  • @jezrael then the output 2017/4/10 10:00 is 1, the rest of 2017/4/10 time such as 11:00,12:00 is 0. – Hong Jan 16 '19 at 09:46
0

Okay, I know that we've already found a solution here but just to satisfy the nerd in me, here's an answer (not elegant given how long it is) to avoid that nagging first-row flaw

pd.merge(df, (df.groupby('date')['flag'].any().astype(int)).to_frame().T.transpose().reset_index(), left_on='date', right_on='date')

Approach remains the same as @jezrael - the groupby function is key here. Instead of using the cumsum, which leads to the first-row flaw, any() appears to fit really well into this solution. The only drawback is that it produces a series, which we then need to coerce back into a dataframe and transpose before joining them together by the date key.

kerwei
  • 1,822
  • 1
  • 13
  • 22