I need to generate a set of items based on a random selection from a given list. These random choices need to append to a list but must be unique so in the Python implementation I used a set to initiate, in the context of a while statement:
import string
import random
def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
return ''.join(random.choice(chars) for _ in range(size))
my_set=set()
while len(my_set)<n+1: #n being the number of items desired
my_set.add(id_generator())
(Credit to https://stackoverflow.com/a/2257449/8840174 for the id_generator syntax.)
What I'd like to do is take advantage of Spark's distributed compute and complete the above much quicker.
Process-wise I'm thinking something like this needs to happen: hold the set on the driver node, and distribute the function out to the workers available to perform id_generator() until there are n unique items in my set. It doesn't seem like there is an equivalent function in PySpark for random.choices, so maybe I need to use the UDF decorator to register the function in PySpark?
This is for a distribution between 0,1 not a random choice selected from some item list. https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.rand.html
@udf
def id_generator():
import string
import random
def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
return ''.join(random.choice(chars) for _ in range(size))
return id_generator()
Something like the above? Although I still am not clear how/if sets work on Spark or not.
This answer is sort of the right idea, though I don't know that collecting the value from a single item spark dataframe is a good idea for millions of iterations.
The code works fine for straight Python, but I'd like to speed it up from several hours if possible. (I need to generate several randomly generated columns based on various rules/list of values to create a dataset from scratch.)
*I know that id_generator() has a size of 6, with some 2,176,782,336 combinations http://mathcentral.uregina.ca/QQ/database/QQ.09.00/churilla1.html so the chance for duplicates is not huge, but even without the set() requirement, I'm still struggling with the best implementation of appending random choices from a list to another list in PySpark.
This looks promising: Random numbers generation in PySpark