get local time in pyspark dependent on a column

Question

In pyspark, one can get the local time from the UTC time by passing the timestamp and the timezone to the function from_utc_timestamp

>>> df = spark.createDataFrame([('1997-02-28 10:30:00',)], ['t'])
>>> df.select(from_utc_timestamp(df.t, "PST").alias('t')).collect()
[Row(t=datetime.datetime(1997, 2, 28, 2, 30))]

The timezone here is provided as a string literal ("PST"). If one were to have the following data structure:

+--------------------------+---------+
| utc_time                 |timezone |
+--------------------------+---------+
|  2018-08-03T23:27:30.000Z|  PST    |
|  2018-08-03T23:27:30.000Z|  GMT    |
|  2018-08-03T23:27:30.000Z|  SGT    |
+--------------------------+---------+

How could one achieve the following new column (preferably without a UDF)?

+--------------------------+-----------------------------------+
| utc_time                 |timezone | local_time              |
+--------------------------+-----------------------------------+
|  2018-08-03T23:27:30.000Z|  PST    | 2018-08-03T15:27:30.000 |
|  2018-08-03T23:27:30.000Z|  GMT    | 2018-08-04T00:27:30.000 |
|  2018-08-03T23:27:30.000Z|  SGT    | 2018-08-04T07:27:30.000 |
+--------------------------+-----------------------------------+

You can use the method on [this post](https://stackoverflow.com/questions/51140470/using-a-column-value-as-a-parameter-to-a-spark-dataframe-function) but `from_utc_timestamp` doesn't seem to like `GMT` or `SGT` (returning timestamp column unchanged). — pault, Aug 22 '18 at 22:25
Looks like the use of 3 letter timezones are deprecated. thanks @pault — jka.ne, Aug 23 '18 at 08:28

score 2 · Accepted Answer · edited Aug 23 '18 at 14:41

Using pyspark.sql.functions.expr() rather the the dataframe API, this can be achieved by:

import pyspark.sql.functions as F

df = df.select(
    '*',
    F.expr('from_utc_timestamp(utc_time, timezone)').alias("timestamp_local")
)

However, the use of 3 letter timezones is deprecated. Per the Java docs:

For compatibility with JDK 1.1.x, some other three-letter time zone IDs (such as "PST", "CTT", "AST") are also supported. However, their use is deprecated because the same abbreviation is often used for multiple time zones (for example, "CST" could be U.S. "Central Standard Time" and "China Standard Time"), and the Java platform can then only recognize one of them.

get local time in pyspark dependent on a column

1 Answers1