The title is a little wordy. I will try to explain what I want to solve--at this point I am quite lost so I really am looking for the best approach on how to solve this problem. I have a dataframe with data that looks something like this:
id_not_unique | datetime | binary_var |
---|---|---|
111111111 | 2020-08-26 15:44:58 | 1 |
111111111 | 2020-08-28 15:33:45 | 1 |
222222222 | 2020-07-12 11:21:09 | 1 |
222222222 | 2019-04-21 14:22:42 | 0 |
I would like to create another column, let's call it 'identifier', that essentially, grouped by id_not_unique, gives a unique identifier based on binary_var. So for each unique id_not_unique, I would like datetime to be ordered in ascending order--this I can do. But then, for some pattern of binary_var being like 1, or [0, 1], the 'identifier' value would be unique. The resulting dataset should look something like:
id_not_unique | datetime | binary_var | identifier |
---|---|---|---|
111111111 | 2020-08-26 15:44:58 | 1 | x_1 |
111111111 | 2020-08-28 15:33:45 | 1 | x_2 |
222222222 | 2019-04-21 14:22:42 | 0 | x_3 |
222222222 | 2020-07-12 11:21:09 | 1 | x_3 |
The binary_var sequence could be any of [1], [0, 1], [0, 0, 1], essentially any number of 0's followed by a 1. Currently, I am grouping the dataframe by id_not_unique and then after ordering datetime, I am iterating through binary_var and checking what value it is and what the value it is at index-1 to determine if the identifier should be new or the same as the identifier at index-1. But this is slow and my rows are at around 3 million. So is there another better method to approach this?