2

I have a dataframe that I need to make unique IDs for. I was previously using the following code:

 np.random.seed(1)

names = parent[['givenName', 'familyName']].agg(' '.join, 1).unique().tolist()

ids = np.random.randint(low=1e9, high=1e10, dtype=np.int64, size = len(names))
maps = {k:v for k,v in zip(names, ids)}
parent['sourcedId'] = parent[['givenName', 'familyName']].agg(' '.join, 1).map(maps)

I am running into an issue where I'm getting repeated IDs. I don't know if it's from people having the same name or what. I can't just assign numbers as the names could come in a different order every time. I also have the option to add parent['phone'] for the ID generation, but multiple people could have the same phone number, which is why I haven't been using it for ID generation up to this point. Any help would be appreciated.

Qoi VamBorg
  • 35
  • 1
  • 5
  • Does the unique id have to involve random numbers or can you use a part of the first name, last name and pandas index value? ex: Adam Smith 22 >>> AdSm22 – IamWarmduscher Feb 17 '22 at 17:53
  • Do you want the unique id to be created based on the data in a row? I.e. will you need to be able to recreate the same number given the same data in the row? – nonDucor Feb 17 '22 at 17:55
  • Yes, hence my dilemma with the possibility of repeated names or phone numbers. If I base it just off of one of those then I run into the problem of repeated IDs, if I combine them then its going to be unique but I have run into other issues doing that. – Qoi VamBorg Feb 17 '22 at 18:07
  • It does not have to be random numbers but I have had the data order change, and I would like to avoid changing the ID number if that happens – Qoi VamBorg Feb 17 '22 at 18:13

1 Answers1

1

One way is to use the uuid package built into python. You can create a new unique identifying column with something like the following.

import uuid

df['uuid'] = df.apply(lambda _: uuid.uuid4(), axis=1)

For example, with the following dataframe

import pandas as pd
import random

df = pd.DataFrame({
    'text': [random.choice(["apple","banana","orange"]) for i in range(5)],
    'number': [random.randint(1,10) for i in range(5)]
                   })

print(df)
     text  number
0   apple       1
1  banana       4
2  orange       7
3  banana       4
4  orange       8

You would get the following

import uuid
​
df['uuid'] = df.apply(lambda _: uuid.uuid4(), axis=1)

print(df)
     text  number                                  uuid
0   apple       1  c46d707c-6903-4464-be7d-efe8fa5e6426
1  banana       4  afbeac71-45d6-4a85-a6a6-04d064397f91
2  orange       7  8ce6e518-6182-41f9-918e-58ccc9cf8d17
3  banana       4  d3d1be90-a010-4cbe-a67c-6593702a6a3d
4  orange       8  5b6a8c94-da25-4456-a3a2-938a74cdec4c
BoomBoxBoy
  • 1,770
  • 1
  • 5
  • 23
  • It is my understanding that uuid uses a time component, is there a way to get around that? – Qoi VamBorg Feb 17 '22 at 17:58
  • Thats a good question. By using `uuid4()` you are avoiding this. See more info here --> https://en.wikipedia.org/wiki/Universally_unique_identifier – BoomBoxBoy Feb 17 '22 at 18:27
  • Awesome, I have implemented your suggestion, is there a way to make the ID re-creatable? – Qoi VamBorg Feb 17 '22 at 18:58
  • Yes, you can do that by setting a random seed with the random module. See more on that here --> https://stackoverflow.com/questions/41186818/how-to-generate-a-random-uuid-which-is-reproducible-with-a-seed-in-python – BoomBoxBoy Feb 17 '22 at 19:04