1

I'm writing some code to bring back a unique ID for each event that comes in in a given version. The value can repeat in a future version as the prefix for the version will change. I have the version information but I'm struggling to bring back the uid. I found some code that seems to produce what I need, found here and have to implement it for what I want but I am facing an issue.

I have the information I need as a dataframe and when I run the code it returns all values as the same unique value. I suspect that the issue stems from how I am using the used set from the example and it isn't being properly stored hence why it returns the same info each time.

Is anyone able to provide some hint on where to look as I can't seem to work out how to persist the information to change it for each row. Side note, I can't use Pandas so I can't use the udf function in there and the uuid module is no good as the requirement is to keep it short to allow easy human typing for searching. I've posted the code below.

import itertools
import string
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

@udf(returnType=StringType())
def uid_generator(id_column):
  valid_chars = set(string.ascii_lowercase + string.digits) - set('lio01')
  
  used = set()
  
  unique_id_generator = itertools.combinations(valid_chars, 6)

  uid = "".join(next(unique_id_generator)).upper()
  while uid in used:
    uid = "".join(next(unique_id_generator))
    
  return uid

  used.add(uid)
  
  
#uuid_udf = udf(uuid_generator,)
  
df2 = df_uid_register_input.withColumn('uid', uid_generator(df_uid_register_input.record))

The output is:

enter image description here

ZygD
  • 22,092
  • 39
  • 79
  • 102
user1663003
  • 149
  • 1
  • 10
  • 1
    You are using a seed. Therefore, the sequence will always be the same. At the end of your function, you `add` the ID but, it is after the `return` therefore, never reached. Moreover, the `used` object is not shared between the workers. So you will always have the same sequence, no matter what. – Steven Jul 12 '22 at 12:20
  • `uid_generator` itself should be a generator, rather than creating a *new* set of combinations and having to walk through it to find the next unused value each time you call it. – chepner Jul 12 '22 at 14:23

1 Answers1

0

In the function definition, you have the argument id_column, but you never use that argument in the function body. And it seems that you haven't tried to use the version column either.

What may be easier for you, is not to aim for true uniqueness, but use one of hash functions. Even though in theory they don't give unique results, but practically, it's just ridiculously unlikely that one would get the same hash for different inputs.

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [(1, 1, 2),
     (2, 1, 2),
     (3, 1, 2)],
    ['record', 'job_id', 'version'])

df = df.select(
    '*',
    F.sha1(F.concat_ws('_', 'record', 'version')).alias('uid1'),
    F.sha2(F.concat_ws('_', 'record', 'version'), 0).alias('uid2'),
    F.md5(F.concat_ws('_', 'record', 'version')).alias('uid3'),
)
df.show()
# +------+------+-------+--------------------+--------------------+--------------------+
# |record|job_id|version|                uid1|                uid2|                uid3|
# +------+------+-------+--------------------+--------------------+--------------------+
# |     1|     1|      2|486cbd63f94d703d2...|0c79023f435b2e9e6...|ab35e84a215f0f711...|
# |     2|     1|      2|f5d7b663eea5f2e69...|48fccc7ee00b72959...|5229803558d4b7895...|
# |     3|     1|      2|982bde375462792cb...|ad9a5c5fb1bc135d8...|dfe3a334fc99f298a...|
# +------+------+-------+--------------------+--------------------+--------------------+
ZygD
  • 22,092
  • 39
  • 79
  • 102