Populating an empty pyspark dataframe with auto-generated dates

Question

I need help populating an empty dataframe in pyspark with auto-generated dates in a column in the format yyyy-mm-dd from 1900-01-01 to 2030-12-31.

please see [this SO question](https://stackoverflow.com/q/73510251/8279585) which has what you require — samkart, Sep 22 '22 at 09:47
I think you should rather use `range` and not `sequence`. I think with a very wide sequences you can run into memory problems (it creates one row which then needs to be _exploded_ into many rows). — botchniaque, Sep 22 '22 at 10:25
Does this answer your question? [SparkSQL on pyspark: how to generate time series?](https://stackoverflow.com/questions/43141671/sparksql-on-pyspark-how-to-generate-time-series) — ZygD, Sep 22 '22 at 11:08
@botchniaque - after a small test (`df = spark.range(500000000); print(df.rdd.getNumPartitions())`), it seems that `range` creates 2 partitions no matter how big or small range is used. So it's not really much of a performance booster compared to `sequence` which uses just one partition. Nevertheless, I like the approach. — ZygD, Sep 22 '22 at 11:20

score 0 · Answer 1 · answered Sep 22 '22 at 10:14

You can use the range() to generate some rows and then cast them to dates like in the examples below:

pyspark:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, to_date, lit

spark = SparkSession.builder.getOrCreate()

(
    spark
    .range(0, 100000)
    .alias("id")
    .select(
        (to_date(lit("1900-01-01")) + col('id').cast("int")).alias("date")
    )
    .where(col("date") <= "2030-12-31")
).show()

The same in SQL

WITH all_dates AS ( 
    SELECT 
        TO_DATE('1900-01-01') + CAST(r.id as int) as date
    FROM RANGE(0, 100000) as r
 )
SELECT *
FROM all_dates
WHERE date <= "2030-12-31"

score 0 · Answer 2 · answered Sep 23 '22 at 15:37

0

Hi you can use the following sample code.

from pyspark.sql.functions import col, date_add, lit

spark.range(0, 14935).withColumn("my_date", date_add(lit("1990-01-01"), col("id").cast("int"))).show()

answered Sep 23 '22 at 15:37

Ranga Reddy

2,936
4
29
41

Populating an empty pyspark dataframe with auto-generated dates

2 Answers2