8

I would like to generate an integer-based unique ID for users (in my df).

Let's say I have:

index  first  last    dob
0      peter  jones   20000101
1      john   doe     19870105
2      adam   smith   19441212
3      john   doe     19870105
4      jenny  fast    19640822

I would like to generate an ID column like so:

index  first  last    dob       id
0      peter  jones   20000101  1244821450
1      john   doe     19870105  1742118427
2      adam   smith   19441212  1841181386
3      john   doe     19870105  1742118427
4      jenny  fast    19640822  1687411973

10 digit ID, but it's based on the value of the fields (john doe identical row values get the same ID).

I've looked into hashing, encrypting, UUID's but can't find much related to this specific non-security use case. It's just about generating an internal identifier.

  • I can't use groupby/cat code type methods in case the order of the rows change.
  • The dataset won't grow beyond 50k rows.
  • Safe to assume there won't be a first, last, dob duplicate.

Feel like I may be tackling this the wrong way as I can't find much literature on it!

Thanks

swifty
  • 1,182
  • 1
  • 15
  • 36
  • 1
    Does something like: `df.groupby(['first', 'last', 'dob'], sort=False).ngroup().apply('{:010}'.format)` do what you want? – Jon Clements Feb 25 '20 at 11:41
  • You can follow this thread to learn more about hashing https://stackoverflow.com/questions/16008670/how-to-hash-a-string-into-8-digits – Mahendra Singh Feb 25 '20 at 12:11

3 Answers3

4

You can try using hash function.

df['id'] = df[['first', 'last']].sum(axis=1).map(hash)

Please note the hash id is greater than 10 digits and is a unique integer sequence.

Mahendra Singh
  • 508
  • 2
  • 9
1

Here's a way of doing using numpy

import numpy as np
np.random.seed(1)

# create a list of unique names
names = df[['first', 'last']].agg(' '.join, 1).unique().tolist()

# generte ids
ids = np.random.randint(low=1e9, high=1e10, size = len(names))

# maps ids to names
maps = {k:v for k,v in zip(names, ids)}

# add new id column
df['id'] = df[['first', 'last']].agg(' '.join, 1).map(maps)

   index  first   last       dob          id
0      0  peter  jones  20000101  9176146523
1      1   john    doe  19870105  8292931172
2      2   adam  smith  19441212  4108641136
3      3   john    doe  19870105  8292931172
4      4  jenny   fast  19640822  6385979058
YOLO
  • 20,181
  • 5
  • 20
  • 40
0

You can apply the below function on your data frame column.

def generate_id(s):
    return abs(hash(s)) % (10 ** 10)

df['id'] = df['first'].apply(generate_id)

In case find out some values are not in exact digits, something like below you can do it -

def generate_id(s, size):
    val = str(abs(hash(s)) % (10 ** size))
    if len(val) < size:
        diff = size - len(val)
        val = str(val) + str(generate_id(s[:diff], diff))
    return int(val)
RockStar
  • 1,304
  • 2
  • 13
  • 35