How to increment value of a new column when duplicate value is found in another column of a dataframe in python?

Question

I've a CSV file that looks like :

Timestamp  Status
1501       Normal
1501       Normal
1502       Delay
1503       Received
1504       Normal
1504       Delay
1505       Received
1506       Received
1507       Delay
1507       Received

I want to add a new "Notif" column to dataframe that appears as a counter variable and has an increment when it comes across the "Received" value in the "Status" column. I want the output to look like :

Timestamp  Status     Notif
1501       Normal     N0
1501       Normal     N0
1502       Delay      N0
1503       Received   N1
1504       Normal     N1
1504       Delay      N1
1505       Received   N2
1506       Received   N3
1507       Delay      N3
1507       Received   N4

I tried searching for the solution to this and various sources suggest to code using the arcpy package but I want to make this work without it as PyCharm doesn't seem to support arcpy package

Also tried using numpy to use as a conditional operator but that doesn't seem to work

`df['Status'].eq('Received').cumsum().map('N{}'.format)` – cs95 Jun 21 '19 at 22:01 — cs95, Jun 21 '19 at 22:01
Yes, this one line of code works perfectly fine too! – Navneet Ratnakirti Jun 22 '19 at 16:34 — Navneet Ratnakirti, Jun 22 '19 at 16:34

score 2 · Accepted Answer · answered Jun 21 '19 at 23:32

Iterating over the rows with df.iterrows you can achieve the following:

df['Notif'] = None
counter = 0
for idx, row in df.iterrows():
    if df.iloc[idx, 1] == "Received":
        counter +=1
    df.iloc[idx,-1] = "N" + str(counter)

print(df)

Output

+----+------------+-----------+-------+
|    | Timestamp  |  Status   | Notif |
+----+------------+-----------+-------+
| 0  |      1501  | Normal    | N0    |
| 1  |      1501  | Normal    | N0    |
| 2  |      1502  | Delay     | N0    |
| 3  |      1503  | Received  | N1    |
| 4  |      1504  | Normal    | N1    |
| 5  |      1504  | Delay     | N1    |
| 6  |      1505  | Received  | N2    |
| 7  |      1506  | Received  | N3    |
| 8  |      1507  | Delay     | N3    |
| 9  |      1507  | Received  | N4    |
+----+------------+-----------+-------+

[Recommending the use of iterrows() for any problem is bad advice.](https://stackoverflow.com/a/55557758) — cs95, Jun 22 '19 at 16:39

score 0 · Answer 2 · answered Dec 13 '20 at 18:51

Itterrows is always your last resort; it's not vectorized and is very very slow. cumsum() is the perfect tool for your problem. The only complication is that you want to cumsum the NON-duplicates. EG:

d = {'Timestamp': ['1501', '1501', '1502', '1503', '1504', '1504', '1505', '1506', '1507', '1507'],
'Status': ['Normal', 'Normal', 'Delay', 'Received', 'Normal', 'Delay', 'Received', 'Received', 'Delay', 'Received']}
df = pd.DataFrame(d)

non_dup_mask = ~df.Status.duplicated()
groupnumber = non_dup_mask.cumsum()
df['Notif'] = 'N' + groupnumber.astype(str)

How to increment value of a new column when duplicate value is found in another column of a dataframe in python?

2 Answers2