I'm working with a large dataset that includes all police stops in my city since 2014. The dataset has millions of rows, but sometimes there are multiple rows for a single stop (so, if the police stopped a group of 4 people, it is included in the database as 4 separate rows even though it's all the same stop). I'm looking to create a new column in the dataset orderInStop
, which is a count of how many people were stopped in sequential order. The first person caught up in the stop would have a value of 1, the second person a value of 2, and so on.
To do so, I have used the groupby()
function to group all rows that match on time & location, which is the indication that the rows are all part of the same stop. I can manage to create a new column that includes the TOTAL count of the number of people in the stop (so, if there were 4 rows with the same time & location, all four rows have a value of 4 for the new orderInStop
variable. But I need the first row in the group to have a value of 1, the second a value of 2, the third 3, and the fourth 4.
Below is my code attempt at iterating through each group I've created to sequentially count each row within each group, but the code doesn't quite work (it populates the entire column rather than each row within the groups). Any help to tweak this code would be much appreciated!
Note: I also tried using logical operators in a for loop, to essentially ask IF the time & location column values match for the current and previous rows, but ran into too many problems with 'the truth values of a Series is ambiguous' errors, so instead I'm trying to use groupby()
.
Attempt that creates a total count rather than sequential count:
df['order2'] = df.groupby(by=["Date_Time_Occur", "Location"])['orderInStop'].transform('count')
Attempt that fails, to iterate through each row in each group:
df['order3'] = 1
grp = df.groupby(by=["Date_Time_Occur", "Location"])
for name, groups in grp:
count = 1
for row in groups:
df['order3'] = count
count = count + 1