0

I have a timestamp string with microseconds as follows:

+-------------------------+
|Time                     |
+-------------------------+
|22-10-2019 09:41:24.87816|
|22-10-2019 09:41:24.87818|
|22-10-2019 09:41:24.87820|
|22-10-2019 09:41:24.87821|
+-------------------------+

I want to convert it to TimestampType(). For example, "22-10-2019 09:41:24.87816" should be 1571737284.87816.

I've tried this:

df= df.withColumn("timestamp", to_timestamp("Time", format="dd-MM-yyyy HH:mm:ss.SSSSS"))

and this:

df= df.withColumn("timestamp", col("Time").cast(TimestampType()))

but both return nulls. What am I doing wrong??

I could create a UDF with datetime.strptime() but that would be too slow. Shouldn't to_timestamp() just work?

lutybr
  • 45
  • 5
  • Possible duplicate of [Convert pyspark string to date format](https://stackoverflow.com/questions/38080748/convert-pyspark-string-to-date-format) – pault Oct 22 '19 at 16:18

1 Answers1

0

The SSS works with millisecond only and looking at your expected output its seems like epoch time, so you can use code below

df.withColumn('unixtimewithmicros', F.concat(F.unix_timestamp('Time', format='dd-MM-yyyy HH:mm:ss'), F.lit('.'), F.split('Time', '\.')[1]))
Sagar
  • 373
  • 1
  • 6
  • Will that run faster than a UDF? I'm currently using: timestamp_udf = udf(lambda d: datetime.datetime.strptime(d, "%d-%m-%Y %H:%M:%S.%f").timestamp(), StringType()) – lutybr Oct 22 '19 at 15:11
  • Ideally spark in-built functions should run faster than UDF as it gets executed inside jvm and thereby reduces the data transfer between jvm and python interpreter. – Sagar Oct 22 '19 at 19:42