0

Suppose I have the following DataFrame:

import pandas as pd
import numpy as np
np.random.seed(0)
n = 10
months, days = np.random.randint(1, 12, n), np.random.randint(1, 28, n)
df = spark.createDataFrame(
    pd.DataFrame({'date': [pd.datetime(2018, x, y).date() for x, y in zip(months, days)]})
)
df.show()
#+----------+
#|      date|
#+----------+
#|2018-06-24|
#|2018-01-07|
#|2018-04-25|
#|2018-04-25|
#|2018-08-13|
#|2018-10-27|
#|2018-04-02|
#|2018-06-07|
#|2018-03-08|
#|2018-05-24|
#+----------+

How can I create a new column that has the last day of the previous month, without using a udf?

pault
  • 41,343
  • 15
  • 107
  • 149

2 Answers2

1

One way is to use pyspark.sql.functions.date_sub to subtract the dayOfMonth from the date column. In order for this to work properly, you'll have to use pyspark.sql.functions.expr to pass a column value as a parameter to date_sub.

from pyspark.sql.functions import expr
df.withColumn("previous_month", expr("date_sub(date, dayOfMonth(date))")).show()
#+----------+--------------+
#|      date|previous_month|
#+----------+--------------+
#|2018-06-24|    2018-05-31|
#|2018-01-07|    2017-12-31|
#|2018-04-25|    2018-03-31|
#|2018-04-25|    2018-03-31|
#|2018-08-13|    2018-07-31|
#|2018-10-27|    2018-09-30|
#|2018-04-02|    2018-03-31|
#|2018-06-07|    2018-05-31|
#|2018-03-08|    2018-02-28|
#|2018-05-24|    2018-04-30|
#+----------+--------------+
pault
  • 41,343
  • 15
  • 107
  • 149
1

Spark has a last_day() and add_months() that you can use to get the last day of previous month. Use the add_months() with -1 to get the previous month, then use the last_day() to get the last day of the month. Here is the sample Scala code.

import org.apache.spark.sql.functions._
val df = Seq(
    (java.sql.Date.valueOf("2020-12-31")), (java.sql.Date.valueOf("2021-01-22")), (java.sql.Date.valueOf("2021-02-15"))
).toDF("my_date")
df.show

val df1 = df.withColumn("last_day_prev_month", last_day(add_months($"my_date", -1)))
df1.show

For Python:

from pyspark.sql.functions import last_day, add_months, to_date, col

df = spark.createDataFrame([('2020-12-31', 1), ('2021-01-22', 2), ('2021-02-15', 3)], ['my_date', 'row'])
df.show(5)

df1 = df.withColumn("last_day_prev_month", last_day(add_months(to_date(col("my_date"), "yyyy-MM-dd"), -1)))
df1.show(5)

Here is the sample output from python.

+----------+---+-------------------+
|   my_date|row|last_day_prev_month|
+----------+---+-------------------+
|2020-12-31|  1|         2020-11-30|
|2021-01-22|  2|         2020-12-31|
|2021-02-15|  3|         2021-01-31|
+----------+---+-------------------+
skybutter
  • 96
  • 5