Pandas - Generate Unique ID based on row values

Question

I would like to generate an integer-based unique ID for users (in my df).

Let's say I have:

index  first  last    dob
0      peter  jones   20000101
1      john   doe     19870105
2      adam   smith   19441212
3      john   doe     19870105
4      jenny  fast    19640822

I would like to generate an ID column like so:

index  first  last    dob       id
0      peter  jones   20000101  1244821450
1      john   doe     19870105  1742118427
2      adam   smith   19441212  1841181386
3      john   doe     19870105  1742118427
4      jenny  fast    19640822  1687411973

10 digit ID, but it's based on the value of the fields (john doe identical row values get the same ID).

I've looked into hashing, encrypting, UUID's but can't find much related to this specific non-security use case. It's just about generating an internal identifier.

I can't use groupby/cat code type methods in case the order of the rows change.
The dataset won't grow beyond 50k rows.
Safe to assume there won't be a first, last, dob duplicate.

Feel like I may be tackling this the wrong way as I can't find much literature on it!

Thanks

Does something like: `df.groupby(['first', 'last', 'dob'], sort=False).ngroup().apply('{:010}'.format)` do what you want? — Jon Clements, Feb 25 '20 at 11:41
You can follow this thread to learn more about hashing https://stackoverflow.com/questions/16008670/how-to-hash-a-string-into-8-digits — Mahendra Singh, Feb 25 '20 at 12:11

score 4 · Answer 1 · answered Feb 25 '20 at 12:01

4

You can try using hash function.

df['id'] = df[['first', 'last']].sum(axis=1).map(hash)

Please note the hash id is greater than 10 digits and is a unique integer sequence.

answered Feb 25 '20 at 12:01

Mahendra Singh

508
2
9

YOLO · Answer 2 · 2020-02-25T12:52:26.937

Here's a way of doing using numpy

import numpy as np
np.random.seed(1)

# create a list of unique names
names = df[['first', 'last']].agg(' '.join, 1).unique().tolist()

# generte ids
ids = np.random.randint(low=1e9, high=1e10, size = len(names))

# maps ids to names
maps = {k:v for k,v in zip(names, ids)}

# add new id column
df['id'] = df[['first', 'last']].agg(' '.join, 1).map(maps)

   index  first   last       dob          id
0      0  peter  jones  20000101  9176146523
1      1   john    doe  19870105  8292931172
2      2   adam  smith  19441212  4108641136
3      3   john    doe  19870105  8292931172
4      4  jenny   fast  19640822  6385979058

would you need to use `seed` to make the generation consistent? — Umar.H, Feb 25 '20 at 12:05

RockStar · Answer 3 · 2020-02-25T12:31:30.230

0

You can apply the below function on your data frame column.

def generate_id(s):
    return abs(hash(s)) % (10 ** 10)

df['id'] = df['first'].apply(generate_id)

In case find out some values are not in exact digits, something like below you can do it -

def generate_id(s, size):
    val = str(abs(hash(s)) % (10 ** size))
    if len(val) < size:
        diff = size - len(val)
        val = str(val) + str(generate_id(s[:diff], diff))
    return int(val)

edited Feb 25 '20 at 12:31

answered Feb 25 '20 at 11:41

RockStar

1,304
2
13
35

This is pretty nice though I'm getting some 9 digit ID's mixed in – swifty Feb 25 '20 at 11:57
Can you share couple of string where 9 digits generated? – RockStar Feb 25 '20 at 12:01
`Sarah Wood`, `Tom Almond` – swifty Feb 25 '20 at 12:14
I have tested on multiple environments, it generating 10 digits only. Check on this link - https://onlinegdb.com/ByUhl5z48 – RockStar Feb 25 '20 at 12:19
@swifty Add some code, you can use, test out, modify the same. – RockStar Feb 25 '20 at 12:32
this is bad code but should demonstrate it - https://onlinegdb.com/rJ6o_qGNU – swifty Feb 25 '20 at 12:53
@swifty I tested your code with my updated function in the answer it works properly. Check - https://onlinegdb.com/B1tucqfN8 – RockStar Feb 25 '20 at 13:00
@swifty Does it helped? – RockStar Feb 25 '20 at 13:33

Pandas - Generate Unique ID based on row values

3 Answers3

Linked