Although two very good answers have been provided, I found them both to be a bit of a heavy hammer to solve the problem. I did not want anything that would require modifying time zone parsing behavior across the whole app, or an approach that would alter the default time zone of my JVM. I did find a solution after much pain, which I will share below...
Parsing time[/date] strings into timestamps for date manipulations, then correctly rendering the result back
First, let's address the issue of how to get Spark SQL to correctly parse a date[/time] string (given a format) into a timetamp and then properly render that timestamp back out so it shows the same date[/time] as the original string input. The general approach is:
- convert a date[/time] string to time stamp [via to_timestamp]
[ to_timestamp seems to assume the date[/time] string represents a time relative to UTC (GMT time zone) ]
- relativize that timestamp to the timezone we are in via from_utc_timestamp
The test code below implements this approach. 'timezone we are in' is passed as the first argument to the timeTricks method. The code converts the input string "1970-01-01" to localizedTimeStamp (via from_utc_timestamp) and verifies that the 'valueOf' of that time stamp is the same as "1970-01-01 00:00:00".
object TimeTravails {
def main(args: Array[String]): Unit = {
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
val spark: SparkSession = SparkSession.builder()
.master("local[3]")
.appName("SparkByExample")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
import java.sql.Timestamp
def timeTricks(timezone: String): Unit = {
val df2 = List("1970-01-01").toDF("timestr"). // can use to_timestamp even without time parts !
withColumn("timestamp", to_timestamp('timestr, "yyyy-MM-dd")).
withColumn("localizedTimestamp", from_utc_timestamp('timestamp, timezone)).
withColumn("weekday", date_format($"localizedTimestamp", "EEEE"))
val row = df2.first()
println("with timezone: " + timezone)
df2.show()
val (timestamp, weekday) = (row.getAs[Timestamp]("localizedTimestamp"), row.getAs[String]("weekday"))
timezone match {
case "UTC" =>
assert(timestamp == Timestamp.valueOf("1970-01-01 00:00:00") && weekday == "Thursday")
case "PST" | "GMT-8" | "America/Los_Angeles" =>
assert(timestamp == Timestamp.valueOf("1969-12-31 16:00:00") && weekday == "Wednesday")
case "Asia/Tokyo" =>
assert(timestamp == Timestamp.valueOf("1970-01-01 09:00:00") && weekday == "Thursday")
}
}
timeTricks("UTC")
timeTricks("PST")
timeTricks("GMT-8")
timeTricks("Asia/Tokyo")
timeTricks("America/Los_Angeles")
}
}
Solution to problem of Structured Streaming Interpreting incoming date[/time] strings as UTC (not local time)
The code below illustrates how to apply the above tricks (with a slight modification) so as to correct the problem of timestamps being shifted by the offset between local time and GMT.
object Struct {
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
def main(args: Array[String]): Unit = {
val timezone = "PST"
val spark: SparkSession = SparkSession.builder()
.master("local[3]")
.appName("SparkByExample")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
val df = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", "9999")
.load()
import spark.implicits._
val splitDf = df.select(split(df("value"), " ").as("arr")).
select($"arr" (0).as("tsString"), $"arr" (1).as("count")).
withColumn("timestamp", to_timestamp($"tsString", "yyyy-MM-dd"))
val grouped = splitDf.groupBy(window($"timestamp", "1 day", "1 day").as("date_window")).count()
val tunedForDisplay =
grouped.
withColumn("windowStart", to_utc_timestamp($"date_window.start", timezone)).
withColumn("windowEnd", to_utc_timestamp($"date_window.end", timezone))
tunedForDisplay.writeStream
.format("console")
.outputMode("update")
.option("truncate", false)
.start()
.awaitTermination()
}
}
The code requires input be fed via socket... I use the program 'nc' (net cat) started like this:
nc -l 9999
Then I start the Spark program and provide net cat with one line of input:
1970-01-01 4
The output I get illustrates the problem with the offset shift:
-------------------------------------------
Batch: 1
-------------------------------------------
+------------------------------------------+-----+-------------------+-------------------+
|date_window |count|windowStart |windowEnd |
+------------------------------------------+-----+-------------------+-------------------+
|[1969-12-31 16:00:00, 1970-01-01 16:00:00]|1 |1970-01-01 00:00:00|1970-01-02 00:00:00|
+------------------------------------------+-----+-------------------+-------------------+
Note that the start and end for date_window is shifted by eight hours from the input (because I am in the GMT-7/8 timezone, PST). However, I correct this shift using to_utc_timestamp to get the proper start and end date times for the one day window that subsumes the input: 1970-01-01 00:00:00,1970-01-02 00:00:00.
Note that in the first block of code presented we used from_utc_timestamp, whereas for the structured streaming solution we used to_utc_timestamp. I have yet to figure out which of these two to use in a given situation. (Please clue me in if you know!).