pyspark to_timestamp() handling format of miliseconds SSS

Question

I have distorted Data,
I am using below function here.

to_timestamp("col","yyyy-MM-dd'T'hh:mm:ss.SSS'Z'")

Data:

time                      | OUTPUT                         | IDEAL
2022-06-16T07:01:25.346Z  | 2022-06-16T07:01:25.346+0000   | 2022-06-16T07:01:25.346+0000
2022-06-16T06:54:21.51Z   | 2022-06-16T06:54:21.051+0000   | 2022-06-16T06:54:21.510+0000
2022-06-16T06:54:21.5Z    | 2022-06-16T06:54:21.005+0000   | 2022-06-16T06:54:21.500+0000

so, I have S or SS or SSS format for milisecond in data.
How can i normalise it into SSS correct way?
Here, 51 miliseconds mean 510 not 051.

Using spark version : 3.2.1 Code :

import pyspark.sql.functions as F
test = spark.createDataFrame([(1,'2022-06-16T07:01:25.346Z'),(2,'2022-06-16T06:54:21.51Z'),(3,'2022-06-16T06:54:21.5Z')],['no','timing1'])
timeFmt = "yyyy-MM-dd'T'hh:mm:ss.SSS'Z'"
test = test.withColumn("timing2", (F.to_timestamp(F.col('timing1'),format=timeFmt)))
test.select("timing1","timing2").show(truncate=False)

Output:

Can you please create a reproducible example? I pasted your formula and to me it returned the "IDEAL" version. Also, which Spark version do you use? — ZygD, Jun 22 '22 at 14:24
Using version 3.2.1, I still see no problem: https://i.stack.imgur.com/OhWaR.png Please create an example which people could copy-paste and see the problem themselves. — ZygD, Jun 22 '22 at 14:42
same as ZygD using the above code : https://i.stack.imgur.com/dohhM.png — Anjaneya Tripathi, Jun 22 '22 at 18:22
You can try transforming all your time strings to SSS format before you apply to_timestamp. This regex would do it: `F.regexp_replace(F.regexp_replace('time', '(?<=\.)(\d\d)(?=Z)', '$10'), '(?<=\.)(\d)(?=Z)', '$100')` — ZygD, Jun 23 '22 at 10:04
I found out the issue : I was using this setting : spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY") I have to turn this off — lil-wolf, Jun 23 '22 at 10:26

score 1 · Answer 1 · answered Jun 23 '22 at 09:37

I also use v3.2.1 and it works for me if you just don't parse the timestamp format. It is already in the right format:

from pyspark.sql import functions as F

test = spark.createDataFrame([(1,'2022-06-16T07:01:25.346Z'),(2,'2022-06-16T06:54:21.51Z'),(3,'2022-06-16T06:54:21.5Z')],['no','timing1'])

new_df = test.withColumn('timing1_ts', F.to_timestamp('timing1'))\

new_df.show(truncate=False)

new_df.dtypes

+---+------------------------+-----------------------+
|no |timing1                 |timing1_ts             |
+---+------------------------+-----------------------+
|1  |2022-06-16T07:01:25.346Z|2022-06-16 07:01:25.346|
|2  |2022-06-16T06:54:21.51Z |2022-06-16 06:54:21.51 |
|3  |2022-06-16T06:54:21.5Z  |2022-06-16 06:54:21.5  |
+---+------------------------+-----------------------+

Out[9]: [('no', 'bigint'), ('timing1', 'string'), ('timing1_ts', 'timestamp')]

score 1 · Answer 2 · answered Jun 23 '22 at 10:27

1

I was using this setting :

 spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")

I have to reset this and it is working as normal.

answered Jun 23 '22 at 10:27

lil-wolf

372
2
15

pyspark to_timestamp() handling format of miliseconds SSS

2 Answers2