Convert timestamp to date in Spark dataframe

Question

I've seen (here: How to convert Timestamp to Date format in DataFrame?) the way to convert a timestamp in datetype, but,at least for me, it doesn't work.

Here is what I've tried:

# Create dataframe
df_test = spark.createDataFrame([('20170809',), ('20171007',)], ['date',])

# Convert to timestamp
df_test2 = df_test.withColumn('timestamp',func.when((df_test.date.isNull() | (df_test.date == '')) , '0')\
.otherwise(func.unix_timestamp(df_test.date,'yyyyMMdd')))\

# Convert timestamp to date again
df_test2.withColumn('date_again', df_test2['timestamp'].cast(stypes.DateType())).show()

But this returns null in the column date_again:

+--------+----------+----------+
|    date| timestamp|date_again|
+--------+----------+----------+
|20170809|1502229600|      null|
|20171007|1507327200|      null|
+--------+----------+----------+

Any idea of what's failing?

I would go with `df_test2.withColumn('date_again', from_unixtime('timestamp', 'yyyyMMdd')).show()` (or whatever format you prefer). Probably some type of a dupe of this https://stackoverflow.com/questions/38080748/convert-pyspark-string-to-date-format — David Arenburg, Aug 31 '17 at 09:08
i have my unix-timestamps of varying lenghts, 11, 12, 13, 15, and sometimes 16. I just need the hours-mins-secs-days-month-year information to be extracted from the unix-timestamps, and not the miliseconds info. How do I process the unix-timestamps to achieve my goal? — n0obcoder, Apr 14 '21 at 12:11

score 24 · Accepted Answer · answered Aug 31 '17 at 09:36

Following:

func.when((df_test.date.isNull() | (df_test.date == '')) , '0')\
  .otherwise(func.unix_timestamp(df_test.date,'yyyyMMdd'))

doesn't work because it is type inconsistent - the first clause returns string while the second clause returns bigint. As a result it will always return NULL if data is NOT NULL and not empty.

It is also obsolete - SQL functions are NULL and malformed format safe. There is no need for additional checks.

In [1]: spark.sql("SELECT unix_timestamp(NULL, 'yyyyMMdd')").show()
+----------------------------------------------+
|unix_timestamp(CAST(NULL AS STRING), yyyyMMdd)|
+----------------------------------------------+
|                                          null|
+----------------------------------------------+


In [2]: spark.sql("SELECT unix_timestamp('', 'yyyyMMdd')").show()
+--------------------------+
|unix_timestamp(, yyyyMMdd)|
+--------------------------+
|                      null|
+--------------------------+

And you don't need intermediate step in Spark 2.2 or later:

from pyspark.sql.functions import to_date

to_date("date", "yyyyMMdd")

Thanks for the explanation, I updated recently to 2.2 and wasn't aware about to_date. — Luis A.G., Aug 31 '17 at 10:15
`"date"` should be `col("date")`, otherwise it is casting a string reproducing null. Therefore needs to import also `col` from the `functions`. The format of the date is an [optional parameter](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=date#pyspark.sql.functions.to_date). — Matteo Guarnerio, Aug 15 '19 at 07:38

score 18 · Answer 2 · answered Aug 31 '17 at 09:33

you should be doing the following

>>> df_test2.withColumn('date_again', func.from_unixtime('timestamp').cast(DateType())).show()
+--------+----------+----------+
|    date| timestamp|date_again|
+--------+----------+----------+
|20170809|1502216100|2017-08-09|
|20171007|1507313700|2017-10-07|
+--------+----------+----------+

and schema is

>>> df_test2.withColumn('date_again', func.from_unixtime('timestamp').cast(DateType())).printSchema()
root
 |-- date: string (nullable = true)
 |-- timestamp: string (nullable = true)
 |-- date_again: date (nullable = true)

Thanks for the answer, the problem was the type inconsistency that mentions @user8371915 in his comment, applying again `from_unixtime` solves it — Luis A.G., Aug 31 '17 at 10:17

score 8 · Answer 3 · answered Nov 07 '17 at 10:58

For pyspark:

Assume you have a field name: 'DateTime' that shows the date as a date and a time

Add a new field to your df that shows a 'DateOnly' column as follows:

 from pyspark.sql.functions  import date_format
    df.withColumn("DateOnly", date_format('DateTime', "yyyyMMdd")).show()

This will show a new column in the df called DateOnly- with the date in yyyymmdd form

Grant Shannon · Answer 4 · 2018-01-04T13:59:57.033

To convert a unix_timestamp column (called TIMESTMP) in a pyspark dataframe (df) -- to a Date type:

Below is a two step process (there may be a shorter way):

convert from UNIX timestamp to timestamp
convert from timestamp to Date

Initially the df.printShchema() shows: -- TIMESTMP: long (nullable = true)

use spark.SQL to implement the conversion as follows:

df.registerTempTable("dfTbl")

dfNew= spark.sql("""
                     SELECT *, cast(TIMESTMP as Timestamp) as newTIMESTMP 
                     FROM dfTbl d
                  """)

dfNew.printSchema()

the printSchema() will show:

-- newTIMESTMP: timestamp (nullable = true)

finally convert the type from timestamp to Date as follows:

from pyspark.sql.types import DateType
dfNew=dfNew.withColumn('actual_date', dfNew['newTIMESTMP'].cast(DateType()))

score 1 · Answer 5 · edited Sep 30 '19 at 06:45

1

#udf to convert the ts to timestamp
get_timestamp = udf(lambda x : datetime.datetime.fromtimestamp(x/ 1000.0).strftime("%Y-%m-%d %H:%M:%S"))

#apply this udf in the dataframe with your timestamp
df_withdate = df.withColumn("datetime", get_timestamp(df.ts))

edited Sep 30 '19 at 06:45

eyllanesc

235,170
19
170
241

answered Sep 30 '19 at 06:36

anidev711

232
4
15

Although it may seem alright and work, it is best to avoid using UDFs as long as a spark-native solution exists. This would become a bottleneck when applied to a large data set. – Reza Keshavarz Oct 01 '22 at 11:41
1

Sure, we can simply add the logic in the withColumn code as well. – anidev711 Oct 21 '22 at 06:18

score 0 · Answer 6 · answered Feb 10 '21 at 10:04

they closed my question as duplicate of this one so I'll copy and paste my answer here (is a duplicate, right?)

As the timestamp column is in milliseconds is just necessary to convert into seconds and cast it into TimestampType and that should do the trick:

from pyspark.sql.types import TimestampType
import pyspark.sql.functions as F

df.select( 
      (F.col("my_timestamp") / 1000).cast(TimestampType())
)

score 0 · Answer 7 · answered Feb 22 '21 at 07:54

0

An option without import TimestampType:

import pyspark.sql.functions as F

F.from_unixtime(F.col('date_col') / 1000).cast('date')

answered Feb 22 '21 at 07:54

ZygD

22,092
39
79
102

score 0 · Answer 8 · answered May 25 '23 at 09:33

0

You can directly cast the field:

df = df.withColumn('datetime', col('timestamp').cast(TimestampType()))

answered May 25 '23 at 09:33

MrBigData

3
6

Convert timestamp to date in Spark dataframe

8 Answers8

Related