I am trying to partition a customer sample set by region and marketplace using Spark SQL, but in the partition result, I would like to select a random 100,000 users that was returned.
Is there a way to use random()
in partition by order by
clause in Spark SQL? The code below using random()
always returns error, but without random()
, it works perfect. Thank you in advance for your answers!
customer_panel_s3_location = f"s3://my-bucket/region_id={region_id}/marketplace_id={marketplace_id}/"
customer_panel_table = spark.read.parquet(customer_panel_s3_location)
customer_panel_table.createOrReplaceTempView("customer_panel")
dataset_date = '2023-03-16'
customer_sample_size = 100000
partition_customers_by = 'region_id, marketplace_id'
# The code below returns an error
customer_panel_df = spark.sql(f"""
SELECT *
FROM (
SELECT *
, row_number() over (partition by {partition_customers_by} order by random()) AS rn
FROM
customer_panel AS c
WHERE
c.target_date < CAST('{dataset_date}' AS DATE)
AND c.target_date >= date_sub(CAST('{dataset_date}' AS DATE), 7)
) t
WHERE t.rn <= bigint({customer_sample_size})
""")
# But after removing 'random()', it works
customer_panel_df = spark.sql(f"""
SELECT *
, row_number() over (partition by {partition_customers_by} order by {partition_customers_by}) AS rn
FROM
customer_panel AS c
WHERE
c.target_date < CAST('{dataset_date}' AS DATE)
AND c.target_date >= date_sub(CAST('{dataset_date}' AS DATE), 7)
""")
print(f"Row count of {table_name}: {customer_panel_df.count():,}")