Adding a column with random numbers within a range in pyspark

Question

I want to generate a column with random numbers like this:

df=df.withColumn("random_col",random.randint(100000, 1000000))

The above gives me an error:

AssertionError: col should be Column

check https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.rand — murtihash, May 08 '20 at 18:58
I get the error AssertionError: col should be Column, any ideas? — mblume, May 08 '20 at 18:58
I don't use dataframes, but it might want a Column object instead of a numpy array as the argument. — TechPerson, May 08 '20 at 18:59
@TechPerson good point. Any ideas on how to make it a column object? — mblume, May 08 '20 at 19:02

love2phish · Answer 1 · 2020-05-08T19:25:33.580

0

First I would make sure you have imported the correct stuff...

Try importing: from pyspark.sql.functions import rand

And then trying something like this line of code:

df1 = df.withColumn("random_col", rand() > 100000, 1000000)

You also could check out this resource. It looks like it may be helpful for what you are doing

Hope this helps!

edited May 08 '20 at 19:25

answered May 08 '20 at 19:05

love2phish

291
2
5
15

helrich · Answer 2 · 2023-05-12T14:39:38.990

Run into this issue and couldn't find anything concrete, eventually figured it out, hopefully this helps anyone stuck:

# To add a column with values from a range of random values first create the column in a new Spark dataframe.

# import libraries
import random
from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType, StringType, StructField, StructType

# Define new df schema
schema = StructType(
[
   StructField("id", StringType(), nullabe=False),
   StructField("random_value", IntegerType(), nullabe=False)
]

# create empty list
data = list()

for i in range(0, 200):  # adjust values as you wish
      data.append(
            {
                 "random_value": random.randint(500, 10000)  # adjust values as you wish
            }
       )

# Create the Spark dataframe
df = spark.createDataFrame(data, schema)

# Add id ordering
df1 = df.withColumn("id", F.monotonically_increasing_id())

Then you need to add another id column on your other dataframe to join on the respective id columns and attach the "random_value" column. See this great example on more information on creating id columns on pre-existing dataframes and then joining.

Adding a column with random numbers within a range in pyspark

2 Answers2