Pyspark: Extract date from Datetime value

Question

I am trying to figure out, how to extract a date from a datetime value using Pyspark sql.

The datetime values look like this:

DateTime
2018-05-21T00:00:00.000-04:00
2016-02-22T02:00:02.234-06:00

When I now load this into a spark dataframe and try to extract the date (via

Date() or
Timestamp() and then Date()

I always get the error, that a date or timestamp value is expected, but a DateTime value was provided.

Can someone help me with retrieving the date from this value? I think, you need to provide a timezone for that - but since I already had problems extracting only the date, I first wanted to solve this.

Thank you and kind regards.

Can you provide a reproducible example with the full error traceback? Please see [how to create good reproducible apache spark dataframe examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-dataframe-examples). — pault, Aug 16 '18 at 15:45

score 39 · Answer 1 · edited Dec 19 '22 at 11:23

39

Pyspark has a to_date function to extract the date from a timestamp. In your example you could create a new column with just the date by doing the following:

from pyspark.sql.functions import col, to_date

df = df.withColumn('date_only', to_date(col('date_time')))

If the column you are trying to convert is a string you can set the format parameter of to_date specifying the datetime format of the string.

You can read more about to_date in the documentation here.

edited Dec 19 '22 at 11:23

Gian Marco

22,140
8
55
44

answered Aug 16 '18 at 15:50

vielkind

2,840
1
16
16

what is func in your code? – user15649753 Mar 14 '22 at 19:08
@user15649753 - from pyspark.sql import functions as func – Arun Mohan Mar 18 '22 at 07:05

notNull · Answer 2 · 2020-09-13T01:25:27.300

You can use either date_format (or) from_unixtime (or) to_date functions to extract date from the input string.

Example:

Input data df data as follows..

#sample dataframe
df=spark.createDataFrame([('2018-05-21T00:00:00.000-04:00',),('2016-02-22T02:00:02.234-06:00',)],['ts'])

#set UTC timestamp
spark.sql("set spark.sql.session.timeZone=UTC")

df.show(10,False)
#+-----------------------------+
#|ts                           |
#+-----------------------------+
#|2018-05-21T00:00:00.000-04:00|
#|2016-02-22T02:00:02.234-06:00|
#+-----------------------------+

1. Using date_format() function:

from pyspark.sql.functions import *
df.select(date_format(col('ts'),"yyyy-MM-dd").alias('ts').cast("date")).show(10,False)
#+----------+
#|ts        |
#+----------+
#|2018-05-21|
#|2016-02-22|
#+----------+

2. Using to_date() function:

df.select(to_date(col('ts')).alias('ts').cast("date")).show(10,False)
#+----------+
#|ts        |
#+----------+
#|2018-05-21|
#|2016-02-22|
#+----------+

3. Using from_unixtime(unix_timestamp()) functions:

df.select(from_unixtime(unix_timestamp(col('ts'),"yyyy-MM-dd'T'HH:mm:ss.SSS"),"yyyy-MM-dd").alias("ts").cast("date")).show(10,False)
#+----------+
#|ts        |
#+----------+
#|2018-05-21|
#|2016-02-22|
#+----------+

Pyspark: Extract date from Datetime value

2 Answers2

Linked