How do I add a unique identifier column to a pandas dataframe?

Question

I have a pandas data frame with a column with long strings. I would like to add a unique identifier. I need to keep all the rows but duplicate strings should get the same ID.

I would like to use this new unique identifier later in a merge.

Let's create a df:

df = pd.DataFrame({
     'longstrings': ['aaaaaaaa', 'asdfasdf', 'bbbbbbbbb', 'asdfasdf', 'aaaaaaaa'], 
     'somevalue': [1, 2, 3, 4, 5]})

Desired output:

  longstrings  somevalue  unique_ID
0    aaaaaaaa          1          0
1    asdfasdf          2          1
2   bbbbbbbbb          3          2
3    asdfasdf          4          1
4    aaaaaaaa          5          0

I have tried to use groupby:

grouped = df.groupby('longstrings')
grouped.transform(lambda ???)

I just do not know how to get a good lambda function. Does groupedhave some kind of index?

I also thought about using a hash-function on my string. However, this does not create handy small numbers. Also, how likely are hash-collisions? My strings are sometimes very similar.

What shall I say? Nvm, exactly what I was looking for. Really sorry for the duplicate. — evilolive, Aug 23 '19 at 14:11

score 5 · Answer 1 · answered Aug 23 '19 at 13:54

5

Python has a builtin hash command which will do what you want.

df['unique_id'] = df.longstrings.map(hash)

answered Aug 23 '19 at 13:54

Meow

1,207
15
23

How do I add a unique identifier column to a pandas dataframe?

1 Answers1