Not sure what your input is but i prepared one solution, you can try something similar
First we need to calculate proportions from input df and then use them to generate new records. I am generating new record with join of inputDf together with proportions and then exploding values to get correct number of rows
from pyspark.sql.functions import col, rand, sum, lit, explode, expr
import random
input_num_rows = 10000
output_num_rows = 20000
possible_values = ["A", "B", "C"]
# Generate a list of random values for the "value" column for the input DataFrame.
# You can hardcode some values here or use your df and skip it
input_data = [(possible_values[i % len(possible_values)],) for i in range(input_num_rows)]
input_df = spark.createDataFrame(input_data, ["value"])
# Calculate proportions from the input DataFrame
proportions = input_df.groupBy("value").agg(sum(lit(1)).alias("count"))
total_count = input_df.count()
proportions = proportions.withColumn("proportion", col("count") / total_count)
# Generate the output DataFrame with the correct proportions
num_values_per_output_row = int(output_num_rows / total_count)
# Duplicate rows based on proportions using "join" and "explode"
output_df = proportions.withColumn("num_repeats", (col("proportion") * output_num_rows).cast("int")) \
.join(input_df, on="value") \
.withColumn("rn", rand()) \
.orderBy("rn") \
.drop("rn") \
.withColumn("repeated_values", explode(expr(f"sequence(0, {num_values_per_output_row - 1})"))) \
.drop("count", "proportion", "num_repeats")
I did a check and output looks like this
Proportions for Input DataFrame:
+-----+-----+----------+
|value|count|proportion|
+-----+-----+----------+
| B| 3333| 0.3333|
| C| 3333| 0.3333|
| A| 3334| 0.3334|
+-----+-----+----------+
Proportions for Output DataFrame:
+-----+-----+----------+
|value|count|proportion|
+-----+-----+----------+
| B| 6666| 0.3333|
| C| 6666| 0.3333|
| A| 6668| 0.3334|
+-----+-----+----------+