8

I am using Pyspark with Python 2.7. I have a date column in string (with ms) and would like to convert to timestamp

This is what I have tried so far

df = df.withColumn('end_time', from_unixtime(unix_timestamp(df.end_time, '%Y-%M-%d %H:%m:%S.%f')) )

printSchema() shows end_time: string (nullable = true)

when I expended timestamp as the type of variable

mayank agrawal
  • 2,495
  • 2
  • 13
  • 32
qqplot
  • 171
  • 1
  • 2
  • 10
  • 4
    Please include a [mcve] with some small sample inputs and the desired output. [How to create good reproducible spark examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples). – pault Jan 24 '19 at 02:47

4 Answers4

8

Try using from_utc_timestamp:

from pyspark.sql.functions import from_utc_timestamp

df = df.withColumn('end_time', from_utc_timestamp(df.end_time, 'PST')) 

You'd need to specify a timezone for the function, in this case I chose PST

If this does not work please give us an example of a few rows showing df.end_time

mayank agrawal
  • 2,495
  • 2
  • 13
  • 32
Tanjin
  • 2,442
  • 1
  • 13
  • 20
  • Thank you, this worked, also preserved the milliseconds! – qqplot Jan 25 '19 at 18:56
  • Hi Tanjin, I came across the same the problem. I tried your method, it went through and I did get a new column 'start_date' in timestamp format, but the values all are null. Can you help to take a look? df1= df1.withColumn('start_date', f.from_utc_timestamp(df1.start_time, 'PST')) df1.printSchema() df1.select('start_time', 'start_date').show(5) root |-- start_time: string (nullable = true) |-- start_date: timestamp (nullable = true) +-------------+----------+ | start_time|start_date| +-------------+----------+ |1597670747141| null| |1597664804901| null| – Frank Aug 24 '20 at 23:42
7

Create a sample dataframe with Time-stamp formatted as string:

import pyspark.sql.functions as F
df = spark.createDataFrame([('22-Jul-2018 04:21:18.792 UTC', ),('23-Jul-2018 04:21:25.888 UTC',)], ['TIME'])
df.show(2,False)
df.printSchema()

Output:

+----------------------------+
|TIME                        |
+----------------------------+
|22-Jul-2018 04:21:18.792 UTC|
|23-Jul-2018 04:21:25.888 UTC|
+----------------------------+
root
|-- TIME: string (nullable = true)

Converting string time-format (including milliseconds ) to unix_timestamp(double). Since unix_timestamp() function excludes milliseconds we need to add it using another simple hack to include milliseconds. Extracting milliseconds from string using substring method (start_position = -7, length_of_substring=3) and Adding milliseconds seperately to unix_timestamp. (Cast to substring to float for adding)

df1 = df.withColumn("unix_timestamp",F.unix_timestamp(df.TIME,'dd-MMM-yyyy HH:mm:ss.SSS z') + F.substring(df.TIME,-7,3).cast('float')/1000)

Converting unix_timestamp(double) to timestamp datatype in Spark.

df2 = df1.withColumn("TimestampType",F.to_timestamp(df1["unix_timestamp"]))
df2.show(n=2,truncate=False)

This will give you following output

+----------------------------+----------------+-----------------------+
|TIME                        |unix_timestamp  |TimestampType          |
+----------------------------+----------------+-----------------------+
|22-Jul-2018 04:21:18.792 UTC|1.532233278792E9|2018-07-22 04:21:18.792|
|23-Jul-2018 04:21:25.888 UTC|1.532319685888E9|2018-07-23 04:21:25.888|
+----------------------------+----------------+-----------------------+

Checking the Schema:

df2.printSchema()


root
 |-- TIME: string (nullable = true)
 |-- unix_timestamp: double (nullable = true)
 |-- TimestampType: timestamp (nullable = true)
Sangram Gaikwad
  • 764
  • 11
  • 21
4

in current version of spark , we do not have to do much with respect to timestamp conversion.

using to_timestamp function works pretty well in this case. only thing we need to take care is input the format of timestamp according to the original column. in my case it was in format yyyy-MM-dd HH:mm:ss. other format can be like MM/dd/yyyy HH:mm:ss or a combination as such.

from pyspark.sql.functions import to_timestamp
df=df.withColumn('date_time',to_timestamp('event_time','yyyy-MM-dd HH:mm:ss'))
df.show()
3

Following might help:-

from pyspark.sql import functions as F
df = df.withColumn("end_time", F.from_unixtime(F.col("end_time"), 'yyyy-MM-dd HH:mm:ss.SS').cast("timestamp"))

[Updated]

Hari Baskar
  • 416
  • 7
  • 12
  • `func.col` is not defined. – Sangram Gaikwad Jan 24 '19 at 11:58
  • 2
    `df = SparkSQLContext.createDataFrame([("2018-02-11 04:21:25.12", ), ("2018-02-12 04:22:26.13", )], ["end_time"])` Does not work on this dataframe, Gives null values in end_time column – Sangram Gaikwad Jan 24 '19 at 13:39
  • Wed Oct 19 00:15:13 EST 2022 I'm trying to convert this to timestamp. I tried all the above logics, but getting some time difference from actual time. Any help would be appreciated. – Anos Oct 27 '22 at 09:05