0

The title is a little wordy. I will try to explain what I want to solve--at this point I am quite lost so I really am looking for the best approach on how to solve this problem. I have a dataframe with data that looks something like this:

id_not_unique datetime binary_var
111111111 2020-08-26 15:44:58 1
111111111 2020-08-28 15:33:45 1
222222222 2020-07-12 11:21:09 1
222222222 2019-04-21 14:22:42 0

I would like to create another column, let's call it 'identifier', that essentially, grouped by id_not_unique, gives a unique identifier based on binary_var. So for each unique id_not_unique, I would like datetime to be ordered in ascending order--this I can do. But then, for some pattern of binary_var being like 1, or [0, 1], the 'identifier' value would be unique. The resulting dataset should look something like:

id_not_unique datetime binary_var identifier
111111111 2020-08-26 15:44:58 1 x_1
111111111 2020-08-28 15:33:45 1 x_2
222222222 2019-04-21 14:22:42 0 x_3
222222222 2020-07-12 11:21:09 1 x_3

The binary_var sequence could be any of [1], [0, 1], [0, 0, 1], essentially any number of 0's followed by a 1. Currently, I am grouping the dataframe by id_not_unique and then after ordering datetime, I am iterating through binary_var and checking what value it is and what the value it is at index-1 to determine if the identifier should be new or the same as the identifier at index-1. But this is slow and my rows are at around 3 million. So is there another better method to approach this?

1 Answers1

1

Here's a snippet that will do what you requested.

import pandas as pd
import numpy as np
import random


# Taken from https://stackoverflow.com/a/50668285/1347611
def pp(start, end, n):
    start_u = start.value//10**9
    end_u = end.value//10**9
    return pd.DatetimeIndex((
            10**9*np.random.randint(start_u, end_u, n,
                                    dtype=np.int64)).view('M8[ns]'))


def get_unique_ids(n_rows):
    # Generating random data
    start = pd.to_datetime('2015-01-01')
    end = pd.to_datetime('2017-01-01')
    date_list = pp(start, end, n_rows)
    id_list = ['1111111', '2222222', '3333333']
    df = pd.DataFrame()
    df['datetime'] = date_list
    # Taken from https://note.nkmk.me/en/python-random-choice-sample-choices/
    df['id_not_unique'] = random.choices(id_list, k=n_rows)
    df['binary_var'] = random.choices([0, 1], k=n_rows)
    # Sorting the data
    df = df.sort_values(by=['id_not_unique', 'datetime', 'binary_var'])
    # Taken from https://stackoverflow.com/a/45846173/1347611
    # Ensuring the last value in a sequence is 1, as you mentioned
    df.loc[df.groupby('id_not_unique')['binary_var'].tail(1).index,
           'binary_var'] = 1
    # Taken from https://stackoverflow.com/a/53116526/1347611
    # Creating a unique value and a identifier per sequence
    df['binary_var_cumsum'] = df['binary_var'][::-1].cumsum()
    df['identifier'] = 'x_' +\
        (np.sign(np.abs(df['binary_var_cumsum'].diff().fillna(0))).cumsum() +
         1).astype(int).astype(str)
    df.index.name = 'index'
    return df


if __name__ == '__main__':
    # Printing to check
    df = get_unique_ids(20)
    df.loc[:, ['id_not_unique', 'datetime', 'binary_var',
               'identifier']]

I generated random data, so the output may differ on every run. However, it will be according to your requested specifications. The output for one of the runs is:

      id_not_unique            datetime  binary_var identifier
index                                                         
9           1111111 2015-01-23 10:48:37           0        x_1
3           1111111 2015-02-06 04:19:00           0        x_1
10          1111111 2015-05-01 14:08:16           0        x_1
8           1111111 2015-08-23 06:05:32           1        x_1
6           1111111 2015-08-23 20:36:21           0        x_2
2           1111111 2016-02-01 22:36:31           1        x_2
19          1111111 2016-04-10 01:49:13           0        x_3
7           1111111 2016-06-26 00:23:26           1        x_3
17          1111111 2016-06-26 01:07:38           0        x_4
18          1111111 2016-06-26 15:26:54           1        x_4
1           2222222 2015-05-21 07:41:51           0        x_5
15          2222222 2015-06-24 05:43:35           1        x_5
14          2222222 2015-09-30 14:01:19           1        x_6
4           2222222 2015-11-14 07:31:39           1        x_7
12          2222222 2016-09-15 09:07:24           0        x_8
11          2222222 2016-12-17 22:11:31           1        x_8
16          3333333 2015-05-03 12:01:33           1        x_9
5           3333333 2015-05-24 10:29:01           1       x_10
13          3333333 2015-09-29 01:34:38           0       x_11
0           3333333 2015-11-21 22:09:08           1       x_11

It took 11.1 seconds for 3 million rows. Not sure what kind of speed you are looking for.

%time dflarge = get_unique_ids(3000000)
Wall time: 11.1 s
  • Actually, I'm not sure this is correct. After running it, the binary_var all becomes 1 and the identifier is just x_(the index) so it doesn't actually index it. – confused_donkey Apr 10 '21 at 06:46
  • If it's not confidential, can you give me a sizeable sample of your data, so I can check? Not sure if I understood your question well then. – pathankhan.salman Apr 10 '21 at 15:52
  • Sorry, the data is from a company. But I think you did understand it, at least looking from the output, it is correct. But looking at the code, doesn't this line 'df.loc[df.groupby('id_not_unique')['binary_var'].tail(1).index, 'binary_var'] = 1' set the last value to 1? It is setting everything to 1 so the identifier are all unique – confused_donkey Apr 11 '21 at 01:44