Pyspark - from_unixtime not showing the correct datetime

Question

I want to convert the timestamp column which contains epoch time into datetime (human readable). from_unixtime is not giving me the correct date and time. Please help.

df = spark.createDataFrame([('1535934855077532656',), ('1535934855077532656',),('1535935539886503614',)], ['timestamp',])

df.show()

+-------------------+
|          timestamp|
+-------------------+
|1535934855077532656|
|1535934855077532656|
|1535935539886503614|
+-------------------+

df.withColumn('datetime',from_unixtime(df.timestamp,"yyyy-MM-dd HH:mm:ss:SSS")).select(['timestamp','datetime']).show(15,False)

+-------------------+----------------------------+
|timestamp          |datetime                    |
+-------------------+----------------------------+
|1535934855077532656|153853867-12-24 10:24:31:872|
|1535934855077532656|153853867-12-24 10:24:31:872|
|1535935539886503614|153875568-09-17 05:33:49:872|
+-------------------+----------------------------+

score 9 · Accepted Answer · answered Nov 29 '18 at 11:04

from_unix_time

Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the given format.

Your data is clearly not expressed with seconds. Maybe nanoseconds?

 from pyspark.sql.functions import col, from_unixtime


df.withColumn(
    'datetime',
   from_unixtime(df.timestamp / 1000 ** 3,"yyyy-MM-dd HH:mm:ss:SSS")
).show(truncate=False)

# +-------------------+-----------------------+
# |timestamp          |datetime               |
# +-------------------+-----------------------+
# |1535934855077532656|2018-09-03 02:34:15:000|
# |1535934855077532656|2018-09-03 02:34:15:000|
# |1535935539886503614|2018-09-03 02:45:39:000|
# +-------------------+-----------------------+

Thank you for answering. Sorry, i am new to Pyspark. Yes, my timestamp is in nanosecond. Can we get the nano second precision using from_unixtime ? Secondly, how do i get the time in a specific timezone. e.g JST (Japan Time). Thank you — Sun, Nov 29 '18 at 11:53
@Sun You can get at most millisecond output with `date_format((df.timestamp / 1000.0 ** 3).cast("timestamp"), "yyyy-MM-dd HH:mm:ss:SSS")` and TZ is determined through configuration ([Spark Strutured Streaming automatically converts timestamp to local time](https://stackoverflow.com/q/48767008/6910411)). — zero323, Nov 29 '18 at 13:28

Pyspark - from_unixtime not showing the correct datetime

1 Answers1