1

I have distorted Data,
I am using below function here.

to_timestamp("col","yyyy-MM-dd'T'hh:mm:ss.SSS'Z'") 

Data:

time                      | OUTPUT                         | IDEAL
2022-06-16T07:01:25.346Z  | 2022-06-16T07:01:25.346+0000   | 2022-06-16T07:01:25.346+0000
2022-06-16T06:54:21.51Z   | 2022-06-16T06:54:21.051+0000   | 2022-06-16T06:54:21.510+0000
2022-06-16T06:54:21.5Z    | 2022-06-16T06:54:21.005+0000   | 2022-06-16T06:54:21.500+0000

so, I have S or SS or SSS format for milisecond in data.
How can i normalise it into SSS correct way?
Here, 51 miliseconds mean 510 not 051.

Using spark version : 3.2.1 Code :

import pyspark.sql.functions as F
test = spark.createDataFrame([(1,'2022-06-16T07:01:25.346Z'),(2,'2022-06-16T06:54:21.51Z'),(3,'2022-06-16T06:54:21.5Z')],['no','timing1'])
timeFmt = "yyyy-MM-dd'T'hh:mm:ss.SSS'Z'"
test = test.withColumn("timing2", (F.to_timestamp(F.col('timing1'),format=timeFmt)))
test.select("timing1","timing2").show(truncate=False)

Output:

enter image description here

lil-wolf
  • 372
  • 2
  • 15
  • 1
    Can you please create a reproducible example? I pasted your formula and to me it returned the "IDEAL" version. Also, which Spark version do you use? – ZygD Jun 22 '22 at 14:24
  • version 3.2.1 @ZygD – lil-wolf Jun 22 '22 at 14:33
  • 1
    Using version 3.2.1, I still see no problem: https://i.stack.imgur.com/OhWaR.png Please create an example which people could copy-paste and see the problem themselves. – ZygD Jun 22 '22 at 14:42
  • @ZygD updated. check – lil-wolf Jun 22 '22 at 15:07
  • 1
    same as ZygD using the above code : https://i.stack.imgur.com/dohhM.png – Anjaneya Tripathi Jun 22 '22 at 18:22
  • 1
    You can try transforming all your time strings to SSS format before you apply to_timestamp. This regex would do it: `F.regexp_replace(F.regexp_replace('time', '(?<=\.)(\d\d)(?=Z)', '$10'), '(?<=\.)(\d)(?=Z)', '$100')` – ZygD Jun 23 '22 at 10:04
  • I found out the issue : I was using this setting : spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY") I have to turn this off – lil-wolf Jun 23 '22 at 10:26

2 Answers2

1

I also use v3.2.1 and it works for me if you just don't parse the timestamp format. It is already in the right format:

from pyspark.sql import functions as F

test = spark.createDataFrame([(1,'2022-06-16T07:01:25.346Z'),(2,'2022-06-16T06:54:21.51Z'),(3,'2022-06-16T06:54:21.5Z')],['no','timing1'])

new_df = test.withColumn('timing1_ts', F.to_timestamp('timing1'))\

new_df.show(truncate=False)

new_df.dtypes

+---+------------------------+-----------------------+
|no |timing1                 |timing1_ts             |
+---+------------------------+-----------------------+
|1  |2022-06-16T07:01:25.346Z|2022-06-16 07:01:25.346|
|2  |2022-06-16T06:54:21.51Z |2022-06-16 06:54:21.51 |
|3  |2022-06-16T06:54:21.5Z  |2022-06-16 06:54:21.5  |
+---+------------------------+-----------------------+

Out[9]: [('no', 'bigint'), ('timing1', 'string'), ('timing1_ts', 'timestamp')]
Luiz Viola
  • 2,143
  • 1
  • 11
  • 30
1

I was using this setting :

 spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")

I have to reset this and it is working as normal.

lil-wolf
  • 372
  • 2
  • 15