0

I want to generate a column with random numbers like this:

df=df.withColumn("random_col",random.randint(100000, 1000000))

The above gives me an error:

AssertionError: col should be Column

mblume
  • 243
  • 1
  • 3
  • 11

2 Answers2

0

First I would make sure you have imported the correct stuff...

Try importing: from pyspark.sql.functions import rand

And then trying something like this line of code:

df1 = df.withColumn("random_col", rand() > 100000, 1000000)

You also could check out this resource. It looks like it may be helpful for what you are doing

Hope this helps!

love2phish
  • 291
  • 2
  • 5
  • 15
0

Run into this issue and couldn't find anything concrete, eventually figured it out, hopefully this helps anyone stuck:

# To add a column with values from a range of random values first create the column in a new Spark dataframe.

# import libraries
import random
from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType, StringType, StructField, StructType

# Define new df schema
schema = StructType(
[
   StructField("id", StringType(), nullabe=False),
   StructField("random_value", IntegerType(), nullabe=False)
]

# create empty list
data = list()

for i in range(0, 200):  # adjust values as you wish
      data.append(
            {
                 "random_value": random.randint(500, 10000)  # adjust values as you wish
            }
       )

# Create the Spark dataframe
df = spark.createDataFrame(data, schema)

# Add id ordering
df1 = df.withColumn("id", F.monotonically_increasing_id())
  • Then you need to add another id column on your other dataframe to join on the respective id columns and attach the "random_value" column. See this great example on more information on creating id columns on pre-existing dataframes and then joining.
helrich
  • 1
  • 1
  • 3