I need help populating an empty dataframe in pyspark with auto-generated dates in a column in the format yyyy-mm-dd from 1900-01-01 to 2030-12-31.
Asked
Active
Viewed 78 times
0
-
please see [this SO question](https://stackoverflow.com/q/73510251/8279585) which has what you require – samkart Sep 22 '22 at 09:47
-
I think you should rather use `range` and not `sequence`. I think with a very wide sequences you can run into memory problems (it creates one row which then needs to be _exploded_ into many rows). – botchniaque Sep 22 '22 at 10:25
-
Does this answer your question? [SparkSQL on pyspark: how to generate time series?](https://stackoverflow.com/questions/43141671/sparksql-on-pyspark-how-to-generate-time-series) – ZygD Sep 22 '22 at 11:08
-
1@botchniaque - after a small test (`df = spark.range(500000000); print(df.rdd.getNumPartitions())`), it seems that `range` creates 2 partitions no matter how big or small range is used. So it's not really much of a performance booster compared to `sequence` which uses just one partition. Nevertheless, I like the approach. – ZygD Sep 22 '22 at 11:20
2 Answers
0
You can use the range()
to generate some rows and then cast them to dates like in the examples below:
pyspark:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, to_date, lit
spark = SparkSession.builder.getOrCreate()
(
spark
.range(0, 100000)
.alias("id")
.select(
(to_date(lit("1900-01-01")) + col('id').cast("int")).alias("date")
)
.where(col("date") <= "2030-12-31")
).show()
The same in SQL
WITH all_dates AS (
SELECT
TO_DATE('1900-01-01') + CAST(r.id as int) as date
FROM RANGE(0, 100000) as r
)
SELECT *
FROM all_dates
WHERE date <= "2030-12-31"

botchniaque
- 4,698
- 3
- 35
- 63
0
Hi you can use the following sample code.
from pyspark.sql.functions import col, date_add, lit
spark.range(0, 14935).withColumn("my_date", date_add(lit("1990-01-01"), col("id").cast("int"))).show()

Ranga Reddy
- 2,936
- 4
- 29
- 41