The primary method here will be to use Series.map()
on the 'ID's to assign the new values.
Used for substituting each value in a Series with another value, that may be derived from a function, a dict
or a Series
.
which is exactly what you're looking for.
Here are some options for generating the new IDs:
1. Randomly generated 8-digit integers, as asked
You can first create a map of randomly generated 8-digit integers with each of the unique ID's in the dataframe. Then use Series.map()
on the 'ID's to assign the new values back. I've included a while
loop to ensure that the generated ID's are unique.
import random
original_ids = df['ID'].unique()
while True:
new_ids = {id_: random.randint(10_000_000, 99_999_999) for id_ in original_ids}
if len(set(new_ids.values())) == len(original_ids):
# all the generated id's were unique
break
# otherwise this will repeat until they are
df['ID'] = df['ID'].map(new_ids)
Output:
ID Sales
0 91154173 100
1 27127403 50
2 91154173 70
3 55892778 60
Edit & Warning: The original ids are Chinese characters and they are already length 8. There's definitely more than 10 Chinese characters so with the wrong combination of original IDs, it could become impossible to make unique-enough 8-digit IDs for the new set. Unless you are memory bound, I'd recommend using 16-24 digits. Or even better...
2. Use UUIDs. [IDEAL]
You can still use the "integer" version of the ID instead of hex. This has the added benefit of not needing to check for uniqueness:
import uuid
original_ids = df['ID'].unique()
new_ids = {cid: uuid.uuid4().int for cid in original_ids}
df['ID'] = df['ID'].map(new_ids)
(If you are okay with hex id's, change uuid.uuid4().int
above to uuid.uuid4().hex
.)
Output:
ID Sales
0 10302456644733067873760508402841674050 100
1 99013251285361656191123600060539725783 50
2 10302456644733067873760508402841674050 70
3 112767087159616563475161054356643068804 60
2.B. Smaller numbers from UUIDs
If the ID generated above is too long, you could truncate it, with some minor risk. Here, I'm only using the first 16 hex characters and converting those to an int. You may put that in the uniqueness loop check as done for option 1, above.
import uuid
original_ids = df['ID'].unique()
DIGITS = 16 # number of hex digits of the UUID to use
new_ids = {cid: int(uuid.uuid4().hex[:DIGITS], base=16) for cid in original_ids}
df['ID'] = df['ID'].map(new_ids)
Output:
ID Sales
0 14173925717660158959 100
1 10599965012234224109 50
2 14173925717660158959 70
3 13414338319624454663 60
3. Creating a mapping based on the actual value:
This group of options has these advantages:
- not needing a uniqueness check since it's deterministically based on the original ID and
- So original IDs which were the same will generate the same new ID
- doesn't need a map created in advance
3.A. CRC32
(Higher probability of finding a collision with different IDs, compared to option 2.B. above.)
import zlib
df['ID'] = df['ID'].map(lambda cid: zlib.crc32(bytes(cid, 'utf-8')))
Output:
ID Sales
0 2083453980 100
1 1445801542 50
2 2083453980 70
3 708870156 60
3.B. Python's built-in hash()
of the orignal ID [My preferred approach in this scenario]
- Can be done in one line, no imports needed
- Reasonably secure to not generate collisions for IDs which are different
df['ID'] = df['ID'].map(hash)
Output:
ID Sales
0 4663892623205934004 100
1 1324266143210735079 50
2 4663892623205934004 70
3 6251873913398988390 60
3.C. MD5Sum, or anything from hashlib
Since the IDs are expected to be small (8 chars), even with MD5, the probability of a collision is very low.
import hashlib
DIGITS = 16 # number of hex digits of the hash to use
df['ID'] = df['ID'].str.encode('utf-8').map(lambda x: int(hashlib.md5(x).hexdigest()[:DIGITS], base=16))
Output:
ID Sales
0 17469287633857111608 100
1 4297816388092454656 50
2 17469287633857111608 70
3 11434864915351595420 60