Create a set from a random item function with PySpark

Question

I need to generate a set of items based on a random selection from a given list. These random choices need to append to a list but must be unique so in the Python implementation I used a set to initiate, in the context of a while statement:

import string
import random
def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
   return ''.join(random.choice(chars) for _ in range(size))
my_set=set()
while len(my_set)<n+1:  #n being the number of items desired
  my_set.add(id_generator())

(Credit to https://stackoverflow.com/a/2257449/8840174 for the id_generator syntax.)

What I'd like to do is take advantage of Spark's distributed compute and complete the above much quicker.

Process-wise I'm thinking something like this needs to happen: hold the set on the driver node, and distribute the function out to the workers available to perform id_generator() until there are n unique items in my set. It doesn't seem like there is an equivalent function in PySpark for random.choices, so maybe I need to use the UDF decorator to register the function in PySpark?

This is for a distribution between 0,1 not a random choice selected from some item list. https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.rand.html

@udf
def id_generator():
  import string
  import random
  def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
    return ''.join(random.choice(chars) for _ in range(size))
  return id_generator()

Something like the above? Although I still am not clear how/if sets work on Spark or not.

This answer is sort of the right idea, though I don't know that collecting the value from a single item spark dataframe is a good idea for millions of iterations.

The code works fine for straight Python, but I'd like to speed it up from several hours if possible. (I need to generate several randomly generated columns based on various rules/list of values to create a dataset from scratch.)

*I know that id_generator() has a size of 6, with some 2,176,782,336 combinations http://mathcentral.uregina.ca/QQ/database/QQ.09.00/churilla1.html so the chance for duplicates is not huge, but even without the set() requirement, I'm still struggling with the best implementation of appending random choices from a list to another list in PySpark.

This looks promising: Random numbers generation in PySpark

I found another useful SO post regarding this, sharing here in case anyone ever wanders over to this post. https://stackoverflow.com/questions/66135534/how-to-update-a-dataframe-in-pyspark-with-random-values-from-another-dataframe — ejwx93, Nov 15 '21 at 21:05

score 0 · Accepted Answer · answered Nov 14 '21 at 12:21

It really depends on your usecase if Spark is the best way to go, however you could do so using a udf of your function on a generated dataframe and dropping duplicates. The drawback of this approach is that due to dropping duplicates it is harder to reach an exact number of datapoints you might desire.

Note 1: I've slightly adjusted your function to use random.choices.

Note 2: If running on multiple nodes, you might need to make sure each node uses a different seed for random.

import string
import random
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf

SIZE = 10 ** 6

spark = SparkSession.builder.getOrCreate()

@udf(StringType())
def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
    return ''.join(random.choices(chars, k=size))

df = spark.range(SIZE)

df = df.withColumn('sample', id_generator()).drop('id')

print(f'Count: {df.count()}')
print(f'Unique count: {df.dropDuplicates().count()}')

df.show(5)

Which gives:

Count: 1000000                                                                  
Unique count: 999783                                                            
+------+
|sample|
+------+
|QTOVIM|
|NEH0SY|
|DJW5Q3|
|WMEKRF|
|OQ09N9|
+------+
only showing top 5 rows

Ah interesting, the spark.range() function will create the empty data frame at the length you want, and the withColumn will apply that random generator for each record iteratively . I.e. You don't need any looping over a range. I was under the impression that appending to a dataframe in a loop was inefficient for memory usage (holds copies of the df in loop iterations) and that appending to a list was the preferred route in python (in my case adding to a set solved the uniqueness aspect as well). But this is handy, and while not unique outright, this is close enough for me to accept. Thanks! — ejwx93, Nov 15 '21 at 14:20

Create a set from a random item function with PySpark

1 Answers1