0

I need help populating an empty dataframe in pyspark with auto-generated dates in a column in the format yyyy-mm-dd from 1900-01-01 to 2030-12-31.

  • please see [this SO question](https://stackoverflow.com/q/73510251/8279585) which has what you require – samkart Sep 22 '22 at 09:47
  • I think you should rather use `range` and not `sequence`. I think with a very wide sequences you can run into memory problems (it creates one row which then needs to be _exploded_ into many rows). – botchniaque Sep 22 '22 at 10:25
  • Does this answer your question? [SparkSQL on pyspark: how to generate time series?](https://stackoverflow.com/questions/43141671/sparksql-on-pyspark-how-to-generate-time-series) – ZygD Sep 22 '22 at 11:08
  • 1
    @botchniaque - after a small test (`df = spark.range(500000000); print(df.rdd.getNumPartitions())`), it seems that `range` creates 2 partitions no matter how big or small range is used. So it's not really much of a performance booster compared to `sequence` which uses just one partition. Nevertheless, I like the approach. – ZygD Sep 22 '22 at 11:20

2 Answers2

0

You can use the range() to generate some rows and then cast them to dates like in the examples below:

pyspark:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, to_date, lit

spark = SparkSession.builder.getOrCreate()

(
    spark
    .range(0, 100000)
    .alias("id")
    .select(
        (to_date(lit("1900-01-01")) + col('id').cast("int")).alias("date")
    )
    .where(col("date") <= "2030-12-31")
).show()

The same in SQL

WITH all_dates AS ( 
    SELECT 
        TO_DATE('1900-01-01') + CAST(r.id as int) as date
    FROM RANGE(0, 100000) as r
 )
SELECT *
FROM all_dates
WHERE date <= "2030-12-31"
botchniaque
  • 4,698
  • 3
  • 35
  • 63
0

Hi you can use the following sample code.

from pyspark.sql.functions import col, date_add, lit

spark.range(0, 14935).withColumn("my_date", date_add(lit("1990-01-01"), col("id").cast("int"))).show()
Ranga Reddy
  • 2,936
  • 4
  • 29
  • 41