0

I am relatively new to Spark SQL, so please forgive me if this question is basic. Suppose, I have a dataframe like below -

+-------------------+-------------------+---------------+-------------------------+--------------------+
|subs_end_day       |subs_start_date    |diff_in_seconds|diff_in_seconds_minus_one|rounded_diff_in_secs|
+-------------------+-------------------+---------------+-------------------------+--------------------+
|2023-03-02 00:00:00|2023-02-02 00:00:00|2419200        |2419199                  |2354163.0           |
|2023-03-02 00:00:00|2022-03-02 00:00:00|31536000       |31535999                 |1.1590502E7         |
|2023-01-01 00:00:00|2022-12-25 00:00:00|604800         |604799                   |477819.0            |
|2023-03-31 00:00:00|2023-02-28 00:00:00|2678400        |2678399                  |1712934.0           |
|2023-03-17 00:00:00|2023-02-15 00:00:00|2592000        |2591999                  |616976.0            |
|2023-01-16 00:00:00|2022-12-16 00:00:00|2678400        |2678399                  |1115980.0           |
|2023-01-31 00:00:00|2022-12-31 00:00:00|2678400        |2678399                  |2248503.0           |
|2023-01-16 00:00:00|2022-12-16 00:00:00|2678400        |2678399                  |2487670.0           |
|2023-03-31 00:00:00|2023-02-28 00:00:00|2678400        |2678399                  |396182.0            |
|2023-01-16 00:00:00|2022-12-16 00:00:00|2678400        |2678399                  |117018.0            |
|2023-04-01 00:00:00|2023-03-01 00:00:00|2678400        |2678399                  |260403.0            |
|2023-02-15 00:00:00|2023-01-15 00:00:00|2678400        |2678399                  |147771.0            |
|2023-01-16 00:00:00|2022-12-16 00:00:00|2678400        |2678399                  |445293.0            |
|2023-03-31 00:00:00|2023-02-28 00:00:00|2678400        |2678399                  |2066904.0           |
|2023-03-17 00:00:00|2023-02-17 00:00:00|2419200        |2419199                  |1555401.0           |
|2023-04-01 00:00:00|2023-03-01 00:00:00|2678400        |2678399                  |1665065.0           |
|2023-03-02 00:00:00|2023-02-02 00:00:00|2419200        |2419199                  |1126743.0           |
+-------------------+-------------------+---------------+-------------------------+--------------------+
only showing top 100 rows

which is generated by the code below -

%pyspark
spark.sql(f"""
SELECT 
    subs_end_day
    ,subs_start_date
    ,unix_timestamp(subs_end_day) - unix_timestamp(subs_start_date) AS diff_in_seconds
    ,((unix_timestamp(subs_end_day) - unix_timestamp(subs_start_date)) -1) AS diff_in_seconds_minus_one
    ,ROUND(( ((unix_timestamp(subs_end_day) - unix_timestamp(subs_start_date)) -1) * RAND()), 0) AS rounded_diff_in_secs  
FROM subscription_df
""").show(100, False)

What I'm trying to do is to select a random date between subs_end_day and subs_start_date using Spark SQL. I got the inspiration from this StackOverflow post. I have seen PySpark example of how to do it using functions like this one, but I'd like to stick to Spark SQL if possible.

To select a random date between subs_end_day and subs_start_date, I tried the following approaches that are commented out in the code chunk below --

%pyspark
spark.sql(f"""
SELECT 
    subs_end_day
    ,subs_start_date
    ,unix_timestamp(subs_end_day) - unix_timestamp(subs_start_date) AS seconds
    ,((unix_timestamp(subs_end_day) - unix_timestamp(subs_start_date)) -1) AS seconds_minus_one
    ,ROUND(( ((unix_timestamp(subs_end_day) - unix_timestamp(subs_start_date)) -1) * RAND()), 0) AS rounded_secs
    --,subs_start_date + INTERVAL ROUND(( ((unix_timestamp(subs_end_day) - unix_timestamp(subs_start_date)) -1) * RAND()) seconds AS random_date
    --,DATEADD(SECOND, ROUND(( ((unix_timestamp(subs_end_day) - unix_timestamp(subs_start_date)) -1) * RAND()), 0), subs_start_date) AS random_date
    --,subs_start_date + make_interval(0, 0, 0, 0, 0, 0, ROUND(( ((unix_timestamp(subs_end_day) - unix_timestamp(subs_start_date)) -1) * RAND()), 0)) AS random_date    
FROM subscription_df
""").show(100, False)

But they did not work. I think the closest one is subs_start_date + INTERVAL ROUND(( ((unix_timestamp(subs_end_day) - unix_timestamp(subs_start_date)) -1) * RAND()) seconds AS random_date, but I don't think INTERVAL .... seconds is happy with ROUND(....) as the parameter.

Any suggestions/answers to fix this issue that I've been trying to solve for the last hour and a half would be greatly appreciated. Thank you!

Abdennacer Lachiheb
  • 4,388
  • 7
  • 30
  • 61
user1330974
  • 2,500
  • 5
  • 32
  • 60

1 Answers1

0

I finally got it to working with the following approach -

spark.sql(f"""
SELECT 
    subs_start_date
    ,subs_end_day
    ,date_add(subs_start_date, INT(ROUND( (RAND() * ((unix_timestamp(subs_end_day) - unix_timestamp(subs_start_date)) -1) /86400), 0))) AS random_date_in_between
FROM subscription_df
""")

But I'd still like to hear from others to see if there is a tidier approach. In other words, I'd rather NOT accept my own comment as an answer. :)

user1330974
  • 2,500
  • 5
  • 32
  • 60