3

I have a pandas data frame with a column with long strings. I would like to add a unique identifier. I need to keep all the rows but duplicate strings should get the same ID.

I would like to use this new unique identifier later in a merge.

Let's create a df:

df = pd.DataFrame({
     'longstrings': ['aaaaaaaa', 'asdfasdf', 'bbbbbbbbb', 'asdfasdf', 'aaaaaaaa'], 
     'somevalue': [1, 2, 3, 4, 5]})

Desired output:

  longstrings  somevalue  unique_ID
0    aaaaaaaa          1          0
1    asdfasdf          2          1
2   bbbbbbbbb          3          2
3    asdfasdf          4          1
4    aaaaaaaa          5          0

I have tried to use groupby:

grouped = df.groupby('longstrings')
grouped.transform(lambda ???)

I just do not know how to get a good lambda function. Does groupedhave some kind of index?

I also thought about using a hash-function on my string. However, this does not create handy small numbers. Also, how likely are hash-collisions? My strings are sometimes very similar.

evilolive
  • 407
  • 1
  • 5
  • 12

1 Answers1

5

Python has a builtin hash command which will do what you want.

df['unique_id'] = df.longstrings.map(hash)
Meow
  • 1,207
  • 15
  • 23