PySpark: Use the primary key of a row as a seed for rand

Question

I'm trying to use the rand function in PySpark to generate a column with random numbers. I would like the rand function to take in the primary key of the row as the seed so that the number is reproducible. However, when I run:

df.withColumn('rand_key', F.rand(F.col('primary_id')))

I get the error

TypeError: 'Column' object is not callable

How can I use the value in the row as my rand seed?

Wasn't able to get it working using expr. Instead I got "AnalysisException: u'Input argument to rand must be an integer, long or null literal.;'" — nao, Mar 26 '19 at 21:33
How are you using `expr`? What is the datatype of `primary_id`? Try `df.withColumn('rand_key', F.expr("rand(primary_id)"))` — pault, Mar 26 '19 at 21:41

score 3 · Accepted Answer · answered Mar 26 '19 at 22:19

The problem with using F.rand(seed) function is that it takes long seed parameter and treats it as literal (static).

One way to go around this is to create your own rand function that would take column as parameter:

import random

def rand(seed):
   random.seed(seed)
   return random.random()

from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType

rand_udf = udf(rand, DoubleType())
df  = spark.createDataFrame([(1, 'a'), (2, 'b'), (1, 'c')], ['a', 'b'])
df.withColumn('rr', rand_udf(df.a)).show()
+---+---+-------------------+
|  a|  b|                 rr|
+---+---+-------------------+
|  1|  a|0.13436424411240122|
|  2|  b| 0.9560342718892494|
|  1|  c|0.13436424411240122|
+---+---+-------------------+

PySpark: Use the primary key of a row as a seed for rand

1 Answers1