0

I have started a spark shell using:

spark-shell --conf spark.sql.session.timeZone=utc

When running the example below the result is a timestamp in the column utc_shifted which is moved around. It does NOT contain the desired output of the UDF but something else. To be specific: the input is UTC and spark shifts it again. How can this behavior be fixed?

+-----------------+-----------------------+-------------------+
|value            |utc_shifted            |fitting            |
+-----------------+-----------------------+-------------------+
|20191009145901202|2019-10-09 12:59:01.202|2019-10-09 14:59:01|
|20191009145514816|2019-10-09 12:55:14.816|2019-10-09 14:55:14|
+-----------------+-----------------------+-------------------+

It looks like not passing the default time zone parameter will fix this, but I am not sure if one of the executors might hold a different/wrong timezone I still get the correct result. So I prefer to set it. Why does this have no effect on spark's own timestamp parsing? How can I get a similar behavior for my UDF?

Reproducible example:

val input = Seq("20191009145901202", "20191009145514816").toDF

import scala.util.{Failure, Success, Try}
import java.sql.Timestamp
import java.text.SimpleDateFormat
import org.apache.spark.sql.DataFrame

def parseTimestampWithMillis(
      timestampColumnInput: String,
      timestampColumnOutput: String,
      formatString: String)(df: DataFrame): DataFrame = {
    def getTimestamp(s: String): Option[Timestamp] = {
      if (s.isEmpty) {
        None
      } else {
        val format = new SimpleDateFormat(formatString)
        Try(new Timestamp(format.parse(s).getTime)) match {
          case Success(t) => {
            println(s"input: ${s}, output: ${t}")
            Some(t)
          }
          case Failure(_) => None
        }
      }
    }

    val getTimestampUDF = udf(getTimestamp _)
    df.withColumn(
      timestampColumnOutput, getTimestampUDF(col(timestampColumnInput)))
  }


input.transform(parseTimestampWithMillis("value", "utc_shifted", "yyyyMMddHHmmssSSS")).withColumn("fitting", to_timestamp(col("value"), "yyyyMMddHHmmssSSS")).show(false)

+-----------------+-----------------------+-------------------+
|value            |utc_shifted            |fitting            |
+-----------------+-----------------------+-------------------+
|20191009145901202|2019-10-09 12:59:01.202|2019-10-09 14:59:01|
|20191009145514816|2019-10-09 12:55:14.816|2019-10-09 14:55:14|
+-----------------+-----------------------+-------------------+

In fact, this setting does not only influence the display but also the output when writing to a file.

edit

basically, I explicitly set the time zone as suggested in Spark Strutured Streaming automatically converts timestamp to local time but get a result which is considered wrong by me.

Georg Heiler
  • 16,916
  • 36
  • 162
  • 292

1 Answers1

0
spark-shell --conf spark.sql.session.timeZone=UTC --conf "spark.driver.extraJavaOptions=-Duser.timezone=UTC" --conf "spark.executor.extraJavaOptions=-Duser.timezone=UTC"

seems to give me the desired result.

But it does not explain why this only applies to my UDF and not to sparks internal to_timestamp function.

Georg Heiler
  • 16,916
  • 36
  • 162
  • 292