I have started a spark shell using:
spark-shell --conf spark.sql.session.timeZone=utc
When running the example below the result is a timestamp in the column utc_shifted
which is moved around. It does NOT contain the desired output of the UDF but something else. To be specific: the input is UTC and spark shifts it again. How can this behavior be fixed?
+-----------------+-----------------------+-------------------+
|value |utc_shifted |fitting |
+-----------------+-----------------------+-------------------+
|20191009145901202|2019-10-09 12:59:01.202|2019-10-09 14:59:01|
|20191009145514816|2019-10-09 12:55:14.816|2019-10-09 14:55:14|
+-----------------+-----------------------+-------------------+
It looks like not passing the default time zone parameter will fix this, but I am not sure if one of the executors might hold a different/wrong timezone I still get the correct result. So I prefer to set it. Why does this have no effect on spark's own timestamp parsing? How can I get a similar behavior for my UDF?
Reproducible example:
val input = Seq("20191009145901202", "20191009145514816").toDF
import scala.util.{Failure, Success, Try}
import java.sql.Timestamp
import java.text.SimpleDateFormat
import org.apache.spark.sql.DataFrame
def parseTimestampWithMillis(
timestampColumnInput: String,
timestampColumnOutput: String,
formatString: String)(df: DataFrame): DataFrame = {
def getTimestamp(s: String): Option[Timestamp] = {
if (s.isEmpty) {
None
} else {
val format = new SimpleDateFormat(formatString)
Try(new Timestamp(format.parse(s).getTime)) match {
case Success(t) => {
println(s"input: ${s}, output: ${t}")
Some(t)
}
case Failure(_) => None
}
}
}
val getTimestampUDF = udf(getTimestamp _)
df.withColumn(
timestampColumnOutput, getTimestampUDF(col(timestampColumnInput)))
}
input.transform(parseTimestampWithMillis("value", "utc_shifted", "yyyyMMddHHmmssSSS")).withColumn("fitting", to_timestamp(col("value"), "yyyyMMddHHmmssSSS")).show(false)
+-----------------+-----------------------+-------------------+
|value |utc_shifted |fitting |
+-----------------+-----------------------+-------------------+
|20191009145901202|2019-10-09 12:59:01.202|2019-10-09 14:59:01|
|20191009145514816|2019-10-09 12:55:14.816|2019-10-09 14:55:14|
+-----------------+-----------------------+-------------------+
In fact, this setting does not only influence the display but also the output when writing to a file.
edit
basically, I explicitly set the time zone as suggested in Spark Strutured Streaming automatically converts timestamp to local time but get a result which is considered wrong by me.