0

I have a DF with data like:

Identificator Date Status
ID1 2021-05-02 19:55:43 OK
ID2 2021-05-02 19:48:01 FAILED
ID3 2021-05-02 19:47:53 OK
ID1 2021-05-03 19:55:43 FAILED
ID2 2021-05-03 20:48:01 OK
ID3 2021-05-03 19:47:53 OK
ID1 2021-05-04 19:55:43 FAILED
ID2 2021-05-04 19:48:01 OK
ID3 2021-05-04 19:47:53 OK

The lines are actually thousands and the IDs are hundreds, but this is not relevant. Let's assume that they are 1 event (line) per ID per day. I want to be able to take only X amount of days back - for example last 3 days and to create grid with number of days back as X axis, IDs as Y axis. Cells are days in a row without success:

X 3 2 1
ID1 0 1 2
ID2 1 0 0
ID3 0 0 0

ID1 on 3 days back is 0, because there is a success on that day. But the same ID has 2 failures after that, so day 1 (the most recent day) is 2 in the cell for this ID.

The idea is to have a Dataframe ready for some kind of heatmap visualization on a later stage, but even in colored xlsx to have visual sense - where the lack of success is more severe, and where is OK

If we imagine ID4 with longer history:

OK,FAIL,OK,OK,FAIL,FAIL,OK,FAIL,FAIL,FAIL,OK

the row should look like

0,1,0,0,1,2,0,1,2,3,0

It's just like a counter of failures in a row which resets to 0 with every OK.

Any help is greatly appreciated

Yavor I
  • 77
  • 1
  • 6
  • I don't understand why ID1 should have a 2 in column 1. I get that 3 days back there were zero failures, and so zero. 2 days back had 1 failure, and 1 day back had 1 failure. If you had some sort of cumulative count, I could understand column 2 having a 2, because there is cumulatively 2 failures going backwards. But if day 1 is cumulative, then I would except ID2 to have `1` in columns 2 and 1, since they would also be counting that failure. – scotscotmcc Jun 04 '21 at 21:26
  • Sorry if I wasn't clear. Yes - the cells are cumulative count of days without success. ID1 2nd of may is finishing OK. That's why column "3" (2.MAY) is 0. Next day is 3rd of may but the day finish with failure. Because this is the 1st day with failure the cell under column "2" (3.MAY) is 1. Column "1" for ID1 is 2, because this is the 2nd day in a row with a failure for this ID. – Yavor I Jun 04 '21 at 21:51
  • 1
    ahh, days-in-a-row is what you are looking for. that makes sense. – scotscotmcc Jun 04 '21 at 21:55
  • If we imagine ID4 with longer history OK,FAIL,OK,OK,FAIL,FAIL,OK,FAIL,FAIL,FAIL,OK the row should look like 0,1,0,0,1,2,0,1,2,3,0 . it's just like a counter of failures in a row which resets to 0 with every OK. – Yavor I Jun 04 '21 at 21:56
  • @YavorI Are you still looking for the answer to the problem? or got the solution? – Ank Jun 10 '21 at 15:27

1 Answers1

0

Create a dataframe with all data we need. The trick is to replace words "FAILED" and "OK" respectively by 1 and NaN. NaN is ignored when you do a cumulative sum.

df1 = pd.DataFrame({"Identificator": df["Identificator"],
                    "Days": df.groupby(["Identificator"]).cumcount(ascending=False)+1,
                    "Status": df["Status"].replace({"FAILED": 1, "OK": np.NaN})})
>>> df1
  Identificator  Days  Status
0           ID1     3     NaN
1           ID2     3     1.0
2           ID3     3     NaN
3           ID1     2     1.0
4           ID2     2     NaN
5           ID3     2     NaN
6           ID1     1     1.0
7           ID2     1     NaN
8           ID3     1     NaN
df1 = df1.pivot(index="Identificator", columns="Days", values="Status") \
         .iloc[:, ::-1] \
         .cumsum(axis="columns") \
         .fillna(0) \
         .astype(int) \
         .rename_axis(index=None, columns=None)

# .iloc[:, ::-1] to reverse order of columns
# cumsum only on FAILED (NaN is ignored)
# set 0 for OK (NaN)
>>> df1
     3  2  1
ID1  0  1  2
ID2  1  0  0
ID3  0  0  0
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • This is almost what I need. I just need the count to reset on each OK. The idea is to count failures in a row, not all failures in the line. If you have line like OK,FAIL,OK,OK,FAIL,FAIL,OK,FAIL,FAIL,FAIL,OK the result to be 0,1,0,0,1,2,0,1,2,3,0 – Yavor I Jun 05 '21 at 14:15
  • As far I can understand cumsum is the one which should reset if hit na. I found this topic: https://stackoverflow.com/questions/48953352/cumsum-reset-at-certain-values and https://stackoverflow.com/questions/18196811/cumsum-reset-at-nan but I am not sure how to implement these in my case – Yavor I Jun 05 '21 at 14:27