I have a pandas data frame with a column with long strings. I would like to add a unique identifier. I need to keep all the rows but duplicate strings should get the same ID.
I would like to use this new unique identifier later in a merge.
Let's create a df:
df = pd.DataFrame({
'longstrings': ['aaaaaaaa', 'asdfasdf', 'bbbbbbbbb', 'asdfasdf', 'aaaaaaaa'],
'somevalue': [1, 2, 3, 4, 5]})
Desired output:
longstrings somevalue unique_ID
0 aaaaaaaa 1 0
1 asdfasdf 2 1
2 bbbbbbbbb 3 2
3 asdfasdf 4 1
4 aaaaaaaa 5 0
I have tried to use groupby:
grouped = df.groupby('longstrings')
grouped.transform(lambda ???)
I just do not know how to get a good lambda function. Does grouped
have some kind of index?
I also thought about using a hash-function on my string. However, this does not create handy small numbers. Also, how likely are hash-collisions? My strings are sometimes very similar.