Create sample dataset for benchmark coversion Pyspark

Question

I have a dataframe df1 which has around 4000 rows and has a string column 'variable' now i want to create a sample dataset for df1 from dataframe df which has 1M rows. Both the dataset have "variable" column in common

The sample dataset to be created should 2X to the size of df1.
The sample dataset to be created should contain the same proportion of variable count as df1.
for example the the count of variable 'BLR' is 750 out of 4000 records which is around 18% in df1. now the sample dataset should also contain 1500 out of 8000 records which accounts to 18%

A related question: [Oversampling or SMOTE in Pyspark](https://stackoverflow.com/q/53936850/2129801) — werner, Jul 29 '23 at 17:58

score 0 · Answer 1 · answered Jul 29 '23 at 16:39

Not sure what your input is but i prepared one solution, you can try something similar

First we need to calculate proportions from input df and then use them to generate new records. I am generating new record with join of inputDf together with proportions and then exploding values to get correct number of rows

from pyspark.sql.functions import col, rand, sum, lit, explode, expr
import random

input_num_rows = 10000

output_num_rows = 20000

possible_values = ["A", "B", "C"]

# Generate a list of random values for the "value" column for the input DataFrame. 
# You can hardcode some values here or use your df and skip it
input_data = [(possible_values[i % len(possible_values)],) for i in range(input_num_rows)]
input_df = spark.createDataFrame(input_data, ["value"])

# Calculate proportions from the input DataFrame
proportions = input_df.groupBy("value").agg(sum(lit(1)).alias("count"))
total_count = input_df.count()
proportions = proportions.withColumn("proportion", col("count") / total_count)

# Generate the output DataFrame with the correct proportions
num_values_per_output_row = int(output_num_rows / total_count)

# Duplicate rows based on proportions using "join" and "explode"
output_df = proportions.withColumn("num_repeats", (col("proportion") * output_num_rows).cast("int")) \
                      .join(input_df, on="value") \
                      .withColumn("rn", rand()) \
                      .orderBy("rn") \
                      .drop("rn") \
                      .withColumn("repeated_values", explode(expr(f"sequence(0, {num_values_per_output_row - 1})"))) \
                      .drop("count", "proportion", "num_repeats")

I did a check and output looks like this

Proportions for Input DataFrame:
+-----+-----+----------+
|value|count|proportion|
+-----+-----+----------+
|    B| 3333|    0.3333|
|    C| 3333|    0.3333|
|    A| 3334|    0.3334|
+-----+-----+----------+

Proportions for Output DataFrame:
+-----+-----+----------+
|value|count|proportion|
+-----+-----+----------+
|    B| 6666|    0.3333|
|    C| 6666|    0.3333|
|    A| 6668|    0.3334|
+-----+-----+----------+

Create sample dataset for benchmark coversion Pyspark

1 Answers1