3

I've a CSV file that looks like :

Timestamp  Status
1501       Normal
1501       Normal
1502       Delay
1503       Received
1504       Normal
1504       Delay
1505       Received
1506       Received
1507       Delay
1507       Received

I want to add a new "Notif" column to dataframe that appears as a counter variable and has an increment when it comes across the "Received" value in the "Status" column. I want the output to look like :

Timestamp  Status     Notif
1501       Normal     N0
1501       Normal     N0
1502       Delay      N0
1503       Received   N1
1504       Normal     N1
1504       Delay      N1
1505       Received   N2
1506       Received   N3
1507       Delay      N3
1507       Received   N4

I tried searching for the solution to this and various sources suggest to code using the arcpy package but I want to make this work without it as PyCharm doesn't seem to support arcpy package

Also tried using numpy to use as a conditional operator but that doesn't seem to work

2 Answers2

2

Iterating over the rows with df.iterrows you can achieve the following:

df['Notif'] = None
counter = 0
for idx, row in df.iterrows():
    if df.iloc[idx, 1] == "Received":
        counter +=1
    df.iloc[idx,-1] = "N" + str(counter)

print(df)

Output

+----+------------+-----------+-------+
|    | Timestamp  |  Status   | Notif |
+----+------------+-----------+-------+
| 0  |      1501  | Normal    | N0    |
| 1  |      1501  | Normal    | N0    |
| 2  |      1502  | Delay     | N0    |
| 3  |      1503  | Received  | N1    |
| 4  |      1504  | Normal    | N1    |
| 5  |      1504  | Delay     | N1    |
| 6  |      1505  | Received  | N2    |
| 7  |      1506  | Received  | N3    |
| 8  |      1507  | Delay     | N3    |
| 9  |      1507  | Received  | N4    |
+----+------------+-----------+-------+
Sebastien D
  • 4,369
  • 4
  • 18
  • 46
  • [Recommending the use of iterrows() for any problem is bad advice.](https://stackoverflow.com/a/55557758) – cs95 Jun 22 '19 at 16:39
0

Itterrows is always your last resort; it's not vectorized and is very very slow. cumsum() is the perfect tool for your problem. The only complication is that you want to cumsum the NON-duplicates. EG:

d = {'Timestamp': ['1501', '1501', '1502', '1503', '1504', '1504', '1505', '1506', '1507', '1507'],
'Status': ['Normal', 'Normal', 'Delay', 'Received', 'Normal', 'Delay', 'Received', 'Received', 'Delay', 'Received']}
df = pd.DataFrame(d)

non_dup_mask = ~df.Status.duplicated()
groupnumber = non_dup_mask.cumsum()
df['Notif'] = 'N' + groupnumber.astype(str)
Autumn
  • 3,214
  • 1
  • 20
  • 35