2

I am trying to parse dates in the Paris timezone (+2 UTC), and PySpark removes the offset when converting from string to timestamp:

df_times = spark.createDataFrame([('2020-12-31T06:53:21.000+02:00',)], ["t"])
# df_times:pyspark.sql.dataframe.DataFrame
#    t:string
df_timestamp = df_times.select(F.to_timestamp(F.col("t")).alias('to_timestamp'))
# df_timestamp:pyspark.sql.dataframe.DataFrame
#    to_timestamp:timestamp
df_timestamp.show()

+-------------------+
|       to_timestamp|
+-------------------+
|2020-12-31 04:53:21|
+-------------------+

Why doesn't PySpark display 2020-12-31 04:53:21 instead of 2020-12-31 06:53:21+02:00?

It's especially frustrating when I try to retrieve the hour:

df_timestamp.select(F.hour("to_timestamp")).show()

+------------------+
|hour(to_timestamp)|
+------------------+
|                 4|
+------------------+

I don't want to display "4" hours, I want "6" as the hours.

Any idea on how to solve this problem?

Be Chiller Too
  • 2,502
  • 2
  • 16
  • 42

1 Answers1

3

From the code of TimestampType:

Internally, a timestamp is stored as the number of microseconds from the epoch of 1970-01-01T00:00:00.000000Z (UTC+00:00)

This means spark does not store the information which the original timezone of the timestamp was but stores the timestamp in UTC. When printing a timestamp, the default timezone of the currently used JVM is used to format the output.

You can set the timezone by setting spark.sql.session.timeZone:

print(spark.conf.get("spark.sql.session.timeZone"))

prints for me

Europe/Berlin

and thus

df_timestamp.show()

gives for me

+-------------------+                                                           
|       to_timestamp|
+-------------------+
|2020-12-31 05:53:21|
+-------------------+

because in December the timezone for Berlin is UTC+1.

If I change the timezone to UTC+2 I get a different result for the same dataframe:

spark.conf.set("spark.sql.session.timeZone", "UTC+2")
df_timestamp.show()

prints


df_timestamp.show()
+-------------------+
|       to_timestamp|
+-------------------+
|2020-12-31 06:53:21|
+-------------------+

The result of hour also depends on the configured timezone.

werner
  • 13,518
  • 6
  • 30
  • 45
  • Thanks for your answer, unfortunately it does not work in my notebook. I tried `spark.conf.set("spark.sql.session.timeZone", "France/Paris")` and 'UTC+2' but it does not change the output – Be Chiller Too Oct 13 '20 at 11:57
  • @BeChillerToo which Spark version are you using? I think the time zone configuration has been [introduced in 2.2](https://issues.apache.org/jira/browse/SPARK-18936). – werner Oct 13 '20 at 18:58