I am relatively new to Spark SQL, so please forgive me if this question is basic. Suppose, I have a dataframe like below -
+-------------------+-------------------+---------------+-------------------------+--------------------+
|subs_end_day |subs_start_date |diff_in_seconds|diff_in_seconds_minus_one|rounded_diff_in_secs|
+-------------------+-------------------+---------------+-------------------------+--------------------+
|2023-03-02 00:00:00|2023-02-02 00:00:00|2419200 |2419199 |2354163.0 |
|2023-03-02 00:00:00|2022-03-02 00:00:00|31536000 |31535999 |1.1590502E7 |
|2023-01-01 00:00:00|2022-12-25 00:00:00|604800 |604799 |477819.0 |
|2023-03-31 00:00:00|2023-02-28 00:00:00|2678400 |2678399 |1712934.0 |
|2023-03-17 00:00:00|2023-02-15 00:00:00|2592000 |2591999 |616976.0 |
|2023-01-16 00:00:00|2022-12-16 00:00:00|2678400 |2678399 |1115980.0 |
|2023-01-31 00:00:00|2022-12-31 00:00:00|2678400 |2678399 |2248503.0 |
|2023-01-16 00:00:00|2022-12-16 00:00:00|2678400 |2678399 |2487670.0 |
|2023-03-31 00:00:00|2023-02-28 00:00:00|2678400 |2678399 |396182.0 |
|2023-01-16 00:00:00|2022-12-16 00:00:00|2678400 |2678399 |117018.0 |
|2023-04-01 00:00:00|2023-03-01 00:00:00|2678400 |2678399 |260403.0 |
|2023-02-15 00:00:00|2023-01-15 00:00:00|2678400 |2678399 |147771.0 |
|2023-01-16 00:00:00|2022-12-16 00:00:00|2678400 |2678399 |445293.0 |
|2023-03-31 00:00:00|2023-02-28 00:00:00|2678400 |2678399 |2066904.0 |
|2023-03-17 00:00:00|2023-02-17 00:00:00|2419200 |2419199 |1555401.0 |
|2023-04-01 00:00:00|2023-03-01 00:00:00|2678400 |2678399 |1665065.0 |
|2023-03-02 00:00:00|2023-02-02 00:00:00|2419200 |2419199 |1126743.0 |
+-------------------+-------------------+---------------+-------------------------+--------------------+
only showing top 100 rows
which is generated by the code below -
%pyspark
spark.sql(f"""
SELECT
subs_end_day
,subs_start_date
,unix_timestamp(subs_end_day) - unix_timestamp(subs_start_date) AS diff_in_seconds
,((unix_timestamp(subs_end_day) - unix_timestamp(subs_start_date)) -1) AS diff_in_seconds_minus_one
,ROUND(( ((unix_timestamp(subs_end_day) - unix_timestamp(subs_start_date)) -1) * RAND()), 0) AS rounded_diff_in_secs
FROM subscription_df
""").show(100, False)
What I'm trying to do is to select a random date between subs_end_day
and subs_start_date
using Spark SQL. I got the inspiration from this StackOverflow post. I have seen PySpark example of how to do it using functions like this one, but I'd like to stick to Spark SQL if possible.
To select a random date between subs_end_day
and subs_start_date
, I tried the following approaches that are commented out in the code chunk below --
%pyspark
spark.sql(f"""
SELECT
subs_end_day
,subs_start_date
,unix_timestamp(subs_end_day) - unix_timestamp(subs_start_date) AS seconds
,((unix_timestamp(subs_end_day) - unix_timestamp(subs_start_date)) -1) AS seconds_minus_one
,ROUND(( ((unix_timestamp(subs_end_day) - unix_timestamp(subs_start_date)) -1) * RAND()), 0) AS rounded_secs
--,subs_start_date + INTERVAL ROUND(( ((unix_timestamp(subs_end_day) - unix_timestamp(subs_start_date)) -1) * RAND()) seconds AS random_date
--,DATEADD(SECOND, ROUND(( ((unix_timestamp(subs_end_day) - unix_timestamp(subs_start_date)) -1) * RAND()), 0), subs_start_date) AS random_date
--,subs_start_date + make_interval(0, 0, 0, 0, 0, 0, ROUND(( ((unix_timestamp(subs_end_day) - unix_timestamp(subs_start_date)) -1) * RAND()), 0)) AS random_date
FROM subscription_df
""").show(100, False)
But they did not work. I think the closest one is subs_start_date + INTERVAL ROUND(( ((unix_timestamp(subs_end_day) - unix_timestamp(subs_start_date)) -1) * RAND()) seconds AS random_date
, but I don't think INTERVAL .... seconds
is happy with ROUND(....)
as the parameter.
Any suggestions/answers to fix this issue that I've been trying to solve for the last hour and a half would be greatly appreciated. Thank you!