1

I have a dateframe containing timestamps (unix):

df = spark.createDataFrame(
    [
        (1527853209,),
        (1527852466,),
        (1527852178,),
        (1527851689,),
        (1527852214,),
    ],
    ["date_time"]
)

and I use the spark configuration below :

OPT_SPARK = {
    'master': 'yarn',
    'spark.executor.extraJavaOptions': '-Duser.timezone=GMT',
    'spark.driver.extraJavaOptions': '-Duser.timezone=GMT',
}

Without changing the spark options, I would like to convert my timestamp from "GMT" to "CET - Central Europe Time".

I tried below code :

from pyspark.sql import functions as F, types as T
from datetime import datetime 
from dateutil import tz 

def conv(in_ts): 

    from_zone = tz.gettz('GMT') 
    to_zone = tz.gettz('CET') 

    utc = datetime.utcfromtimestamp(in_ts) 
    utc = utc.replace(tzinfo=from_zone)
    n_ts = utc.astimezone(to_zone).replace(tzinfo=None)

    return n_ts


conv_udf = F.udf(conv, T.TimestampType()) 

It works fine when I test the function, but not in spark :

# 1527853209 is GMT: Friday 1 June 2018 11:40:09
conv(1527853209)
datetime.datetime(2018, 6, 1, 13, 40, 9)

df.select( 
    "date_time",
    F.col("date_time").cast("timestamp"), 
    conv_udf("date_time")
).show() 

+----------+-------------------+-------------------+
| date_time|          date_time|    conv(date_time)|
+----------+-------------------+-------------------+
|1527853209|2018-06-01 11:40:09|2018-06-01 11:40:09|
|1527852466|2018-06-01 11:27:46|2018-06-01 11:27:46|
|1527852178|2018-06-01 11:22:58|2018-06-01 11:22:58|
|1527851689|2018-06-01 11:14:49|2018-06-01 11:14:49|
|1527852214|2018-06-01 11:23:34|2018-06-01 11:23:34|
+----------+-------------------+-------------------+

I could not find any builtin function to achieve that, so using a UDF seemed to be the best solution but, apparently, it is not working as expected.

Expected result :

+----------+-------------------+-------------------+
| date_time|          date_time|    conv(date_time)|
+----------+-------------------+-------------------+
|1527853209|2018-06-01 11:40:09|2018-06-01 13:40:09|
|1527852466|2018-06-01 11:27:46|2018-06-01 13:27:46|
|1527852178|2018-06-01 11:22:58|2018-06-01 13:22:58|
|1527851689|2018-06-01 11:14:49|2018-06-01 13:14:49|
|1527852214|2018-06-01 11:23:34|2018-06-01 13:23:34|
+----------+-------------------+-------------------+
Steven
  • 14,048
  • 6
  • 38
  • 73
  • 1
    Timestamps should be zone independant. See [this](https://stackoverflow.com/questions/23062515/do-unix-timestamps-change-across-timezones). – Toni Sredanović May 29 '19 at 09:47
  • I know, I still want to have the final result as CET time. – Steven May 29 '19 at 09:56
  • Don't use the string `'CET'`. Choose [a canonical time zone that *observes* CET](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones) such as `'Europe/Paris'`. – Matt Johnson-Pint May 29 '19 at 16:46

1 Answers1

1

I think what you are doing is right, but when you print out the result using show() your conv(date_time) is getting traduced to your timezone (GMT).

If the only thing you need is to show the datetime in the new timezone (CET), you can rewrite the udf as a StringType:

def conv(in_ts):

from_zone = tz.gettz('GMT')
to_zone = tz.gettz('CET')

utc = datetime.utcfromtimestamp(in_ts)
utc = utc.replace(tzinfo=from_zone)
n_ts = utc.astimezone(to_zone)

return n_ts.strftime('%x %X')


conv_udf = F.udf(conv, T.StringType())
Carlos Vilchez
  • 2,774
  • 28
  • 30