10

I am using Spark 2.1 with Scala.

How to convert a string column with milliseconds to a timestamp with milliseconds?

I tried the following code from the question Better way to convert a string field into timestamp in Spark

import org.apache.spark.sql.functions.unix_timestamp
val tdf = Seq((1L, "05/26/2016 01:01:01.601"), (2L, "#$@#@#")).toDF("id", "dts")
val tts = unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss.SSS").cast("timestamp")
tdf.withColumn("ts", tts).show(2, false)

But I get the result without milliseconds:

+---+-----------------------+---------------------+
|id |dts                    |ts                   |
+---+-----------------------+---------------------+
|1  |05/26/2016 01:01:01.601|2016-05-26 01:01:01.0|
|2  |#$@#@#                 |null                 |
+---+-----------------------+---------------------+
keiv.fly
  • 3,343
  • 4
  • 26
  • 45
  • New version of Spark has fixed it. [How to understand to_timestamp for conversion from String to Timestamp in Spark Scala?](https://stackoverflow.com/q/57948790/6926749) – timothyzhang Sep 18 '19 at 17:17

3 Answers3

9

UDF with SimpleDateFormat works. The idea is taken from the Ram Ghadiyaram's link to an UDF logic.

import java.text.SimpleDateFormat
import java.sql.Timestamp
import org.apache.spark.sql.functions.udf
import scala.util.{Try, Success, Failure}

val getTimestamp: (String => Option[Timestamp]) = s => s match {
  case "" => None
  case _ => {
    val format = new SimpleDateFormat("MM/dd/yyyy' 'HH:mm:ss.SSS")
    Try(new Timestamp(format.parse(s).getTime)) match {
      case Success(t) => Some(t)
      case Failure(_) => None
    }    
  }
}

val getTimestampUDF = udf(getTimestamp)
val tdf = Seq((1L, "05/26/2016 01:01:01.601"), (2L, "#$@#@#")).toDF("id", "dts")
val tts = getTimestampUDF($"dts")
tdf.withColumn("ts", tts).show(2, false)

with output:

+---+-----------------------+-----------------------+
|id |dts                    |ts                     |
+---+-----------------------+-----------------------+
|1  |05/26/2016 01:01:01.601|2016-05-26 01:01:01.601|
|2  |#$@#@#                 |null                   |
+---+-----------------------+-----------------------+
keiv.fly
  • 3,343
  • 4
  • 26
  • 45
  • awesome! unfortunately I dont have any test envt of spark :-) – Ram Ghadiyaram Jul 03 '17 at 16:02
  • @RamGhadiyaram - How is the performance of this udf? does it create a new SimpleDateFormat for every call to the udf? – Remis Haroon - رامز Mar 11 '20 at 10:45
  • @RemisHaroon First thing is we need to use functions like `date_format` which will use simple data format (since all optimization was done in spark scala code ) in this case seems like its not giving correct output to original poster. So he went ahead and did udf. creating simple date format is minor impact. if you feel it can be out side the udf (one object instead of multiple object), feel free to do change and test. – Ram Ghadiyaram Mar 11 '20 at 20:34
6

There is an easier way than making a UDF. Just parse the millisecond data and add it to the unix timestamp (the following code works with pyspark and should be very close the scala equivalent):

timeFmt = "yyyy/MM/dd HH:mm:ss.SSS"
df = df.withColumn('ux_t', unix_timestamp(df.t, format=timeFmt) + substring(df.t, -3, 3).cast('float')/1000)

Result: '2017/03/05 14:02:41.865' is converted to 1488722561.865

Paul Bendevis
  • 2,381
  • 2
  • 31
  • 42
-1
import org.apache.spark.sql.functions;
import org.apache.spark.sql.types.DataTypes;


dataFrame.withColumn(
    "time_stamp", 
    dataFrame.col("milliseconds_in_string")
        .cast(DataTypes.LongType)
        .cast(DataTypes.TimestampType)
)

the code is in java and it is easy to convert to scala

Scalway
  • 1,633
  • 10
  • 18