EDIT: spark scala inbuilt udf : to_timestamp() ignores the millisecond part of the timestamp value

Question

Sample Code:

val sparkSession = SparkUtil.getSparkSession("timestamp_format_test")

import sparkSession.implicits._

val format = "yyyy/MM/dd HH:mm:ss.SSS"
val time = "2018/12/21 08:07:36.927"

val df = sparkSession.sparkContext.parallelize(Seq(time)).toDF("in_timestamp")

val df2 = df.withColumn("out_timestamp", to_timestamp(df.col("in_timestamp"), format))

Output:

df2.show(false)

plz notice: out_timestamp loses the milli-second part from the original value

+-----------------------+-------------------+
|in_timestamp           |out_timestamp      |
+-----------------------+-------------------+
|2018/12/21 08:07:36.927|2018-12-21 08:07:36|
+-----------------------+-------------------+

df2.printSchema()

root
 |-- in_timestamp: string (nullable = true)
 |-- out_timestamp: timestamp (nullable = true)

In the above result: in_timestamp is of string type, and I would like to convert to timestamp data type, it does get convert but the millisecond part gets lost. Any idea.? Thanks.!

check https://stackoverflow.com/questions/44886772/how-to-convert-a-string-column-with-milliseconds-to-a-timestamp-with-millisecond — Nayan Sharma, Aug 13 '19 at 21:08
Weird that this hasn't been fixed yet. OP's code works, if he uses the to_timestamp function which doesn't take the format specifier.Since this is also the default format used in spark. — Gforz, Aug 13 '19 at 21:40
@Nayan, those were not helpful and do not match with my scenario, plz suggest some soln.! — TheCodeCache, Aug 14 '19 at 14:07

score 1 · Answer 1 · answered Aug 14 '19 at 06:12

1

Sample code for preserving millisecond during conversion from string to timestamp.

val df2 = df.withColumn("out_timestamp", to_timestamp(df.col("in_timestamp")))

df2.show(false)

+-----------------------+-----------------------+
|in_timestamp           |out_timestamp          |
+-----------------------+-----------------------+
|2018-12-21 08:07:36.927|2018-12-21 08:07:36.927|
+-----------------------+-----------------------+


scala> df2.printSchema
root
 |-- in_timestamp: string (nullable = true)
 |-- out_timestamp: timestamp (nullable = true)

You just need to remove format parameter from to_timestamp. This will save your result with data type timestamp similar to String value.

answered Aug 14 '19 at 06:12

Mahesh Gupta

1,882
12
16

Thanks Mahesh for helpful answer,! but the answer works only in case when the data is in ISO format. Hence I've edited/updated the input data and its format, actually this was my actual problem. plz suggest something – TheCodeCache Aug 14 '19 at 14:05
1

@Manoranjan I am not getting any direct way to convert and preserve Millisecond but there is a way to solve your problem hope it helps you. var df2 = df.withColumn("out_timestamp",to_timestamp(regexp_replace(col("in_timestamp"),"/","-"))) it also solves your purpose please accept the answer other let me know if you have any concern – Mahesh Gupta Aug 16 '19 at 04:56
Thanks for your suggestion, (+1) to the answer and comment, however sorry to say but my question is still open and not fully answered. replacing "/" with "-" is just the temporary fix, what if we have variety of patterns this replace mechanism is not going to work, hence, I have set up the apache spark open source code from github in my local and now I am planning to experiment what is going under the hood around this api, so that I can find why spark behaves like that and propose any correction or fix to this problem. Thanks for you effort, I recognize and appreciate that. : ) happy coding,! – TheCodeCache Aug 17 '19 at 06:57

EDIT: spark scala inbuilt udf : to_timestamp() ignores the millisecond part of the timestamp value

1 Answers1