I am trying to parse dates in the Paris timezone (+2 UTC), and PySpark removes the offset when converting from string to timestamp:
df_times = spark.createDataFrame([('2020-12-31T06:53:21.000+02:00',)], ["t"])
# df_times:pyspark.sql.dataframe.DataFrame
# t:string
df_timestamp = df_times.select(F.to_timestamp(F.col("t")).alias('to_timestamp'))
# df_timestamp:pyspark.sql.dataframe.DataFrame
# to_timestamp:timestamp
df_timestamp.show()
+-------------------+
| to_timestamp|
+-------------------+
|2020-12-31 04:53:21|
+-------------------+
Why doesn't PySpark display 2020-12-31 04:53:21
instead of 2020-12-31 06:53:21+02:00
?
It's especially frustrating when I try to retrieve the hour:
df_timestamp.select(F.hour("to_timestamp")).show()
+------------------+
|hour(to_timestamp)|
+------------------+
| 4|
+------------------+
I don't want to display "4" hours, I want "6" as the hours.
Any idea on how to solve this problem?