0

Just thinking if there is a better way to do this.

Currently I have a working function that generates a 5-alphanumeric keycode based on a given index or number.

The problem is it takes too much time generating it. I'm expecting at least 30 Million records and I've tried running it for just a million records and it takes forever.

Does anyone may suggest how to make this code cleaner and faster? Thanks in advance.

import time


def generate_unique(index_id):
    BASE = 35;                                          # zero-based
    base36 = ['0','1','2','3','4','5','6','7','8','9',
              'a','b','c','d','e','f','g','h','i','j','k',
              'm','n','o','p','q','r','s','t','u','v','w','x','y','z']

    idx = [0, 0, 0, 0, 0]

    for i in range(0, index_id - 1):
        idx[4] = idx[4] + 1
        if idx[4] == BASE:
            idx[4] = 0
            idx[3] = idx[3] + 1
            if idx[3] == BASE:
                idx[3] = 0
                idx[2] = idx[2] + 1
                if idx[2] == BASE:
                    idx[2] = 0
                    idx[1] = idx[1]+1
                    if idx[1] == BASE:
                        idx[1] = 0
                        idx[0] = idx[0] + 1

    return base36[idx[0]] + base36[idx[1]] + base36[idx[2]] + base36[idx[3]] + base36[idx[4]]


t1 = time.process_time()
for i in range(1, 1000000):
    generate_unique(i)
t2 = time.process_time()
print(f"Process completed successfully in {t2 - t1} seconds.")
BLitE.exe
  • 311
  • 2
  • 19
  • Does it need to be 5 digits? Does this help? https://stackoverflow.com/questions/1210458/how-can-i-generate-a-unique-id-in-python – Axe319 Jun 15 '20 at 16:59
  • Hi @Axe319, Yes, as much as possible all I wanted is a unique key that can accomodate at least 100Million of records. So I think a 5 alphanumeric Base36 character code is sufficient for that. – BLitE.exe Jun 15 '20 at 17:04
  • 1
    The reason I ask if it doesn't need to be 5 digits is because something like `[str(uuid.uuid4()) for _ in range(1000000)]` takes around 4 seconds to run on my machine and it gives you virtually infinite room for growth. The only caveat is it's a 36 char string. – Axe319 Jun 15 '20 at 17:12
  • I think as long as it produces a 5 digit unique code it is alright. – BLitE.exe Jun 15 '20 at 17:17
  • whats the use case, are you using them as a unique key for your data? – Umar.H Jun 15 '20 at 17:18
  • No, I'm just going to concatenate this unique code to a string. On this case, lets say a Company Name. I just need it to be able to treat these company names as unique in the future. – BLitE.exe Jun 15 '20 at 17:39
  • "The goal is to use a column which has a unique name. If you have another way to do that then that will be fine but it will be used for appending data back to my script's database and the unique name may not have any non Alphanumeric characters. If I used only business name then there will be a duplicate issue when doing other data sets later on that have not been run through my script up to that point as the same business name will show up again in another state etc." – BLitE.exe Jun 15 '20 at 17:52
  • So why not use a uuid? – juanpa.arrivillaga Jun 15 '20 at 20:59

2 Answers2

1

You can use numpy's base_repr:

import numpy as np   

f'{np.base_repr(index_id-1, 36).lower():0>5}'

You skipped the letter 'l' in your implementation: if you add it to base36 and set BASE = 36 this function will return the same result.

Timings:

%timeit generate_unique(1_000_000)
#169 ms ± 705 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit f'{np.base_repr(1_000_000-1, 36).lower():0>5}'
#2.67 µs ± 51.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

so base_repr is over 63000 times faster than the loop solution.

Stef
  • 28,728
  • 2
  • 24
  • 52
0

you can try something like this:

codes = []
for m in range(35):
    for l in range(35):
        for k in range(35):
            for j in range(35):
                for i in range(35):
                    codes.append(base36[m]+base36[l]+base36[k]+base36[j]+base36[i])

It took less than 2 minutes in my computer.