0

I have a dataframe like below:

+------+----------+----+
|ID    |  date    |flag|
+------+----------+----+
|123456|2015-04-21|null|
|234567|2017-04-18|null|
|345678|2009-06-25|null|
|456789|2001-11-07|null|
|567890|2016-10-02|null|
+------+----------+----+

I am trying to modify the dataframe to change the dates in the date column to show as 'YYYY-mm-01' like below.

+------+----------+----+
|ID    |  date    |flag|
+------+----------+----+
|123456|2015-04-01|null|
|234567|2017-04-01|null|
|345678|2009-06-01|null|
|456789|2001-11-01|null|
|567890|2016-10-01|null|
+------+----------+----+

I am trying to do so like this:

df = df.withColumn("date", f.trunc("date", "month"))

But it looks as if it's messing up the date and making all the dates the same date. How can I change my pyspark column elements from their original YYYY-mm-dd to YYYY-mm-01 for every row?

JMV12
  • 965
  • 1
  • 20
  • 52
  • What's the data type for the `date` column? Is it a string? (try `df.printSchema()`) – pault Jan 29 '20 at 21:00
  • Looks like they're strings. – JMV12 Jan 29 '20 at 21:55
  • 1
    Ok then convert to `DateType` first. [Convert pyspark string to date format](https://stackoverflow.com/questions/38080748/convert-pyspark-string-to-date-format) and then what you tried should work. – pault Jan 29 '20 at 22:00
  • Thank you very much for the clarification! – JMV12 Jan 29 '20 at 22:18

1 Answers1

0

You can use datetime.replace. For example, say you have one date df[0]['date']

date = datetime.strptime('df[0]['date']', '%y-%m-%d')
newdate = date.replace(day=1)
Alireza Tajadod
  • 327
  • 1
  • 8