2

I have timestamps in millisecond format and need to convert them from system time to UTC. Anyways...when doing the transformation spark gobbles my milliseconds and just shows them as zeros.

Short example:

from pyspark import Row
from pyspark import SparkContext
from pyspark.sql.functions import to_timestamp, date_format

spark = SparkContext.getOrCreate()

test = spark.createDataFrame([Row(timestamp = "2018-03-24 14:37:12,133")])
test_2 = test.withColumn('timestamp_2', to_timestamp('timestamp', 'yyyy-MM-dd HH:mm:ss,SSS'))
test_3 = test_2.withColumn('timestamp_3', date_format('timestamp_2', 'yyyy-MM-dd HH:mm:ss,SSS'))
test_3.write.option('header', True).csv('something')

This will result in:

timestamp,timestamp_2,timestamp_3
"2018-03-24 14:37:12,133",2018-03-24T14:37:12.000+01:00,"2018-03-24 14:37:12,000"

Can I somehow preserve the milliseconds?

I am using python 3.6.4 and spark version 2.3.2.

guscht
  • 843
  • 4
  • 20
  • Apparently that's just something `to_timestamp` does. I have no idea why they designed it that way, and neither the [Python docs](https://spark.apache.org/docs/2.3.2/api/python/pyspark.sql.html?highlight=dateformat#pyspark.sql.functions.to_timestamp) nor the [Scala docs](https://spark.apache.org/docs/2.3.2/api/scala/index.html#org.apache.spark.sql.functions$) make sense; the Python docs say it's converting to *DateType*, which makes even less sense than what you're seeing, and the Scala docs have weirdly broken English in `to_timestamp`'s description. – user2357112 Oct 12 '18 at 17:54
  • The Scala docs at least say "in seconds", but with no indication of why it would be in seconds when TimestampType has more resolution than that. They also say "Unix timestamp", but that doesn't make much sense either when the output format is TimestampType, not a Unix timestamp. – user2357112 Oct 12 '18 at 17:57
  • Okay. Do you by any chance know a not too ugly solution so that I can convert it to a timestamp? :) – guscht Oct 12 '18 at 18:02
  • 1
    https://stackoverflow.com/questions/44886772/how-to-convert-a-string-column-with-milliseconds-to-a-timestamp-with-millisecond could be useful. – user2357112 Oct 12 '18 at 18:05
  • What is your SQL backend? This must be a limitation of its timestamp type or something. – ivan_pozdeev Oct 12 '18 at 19:19
  • Spark does not utilize any SQL backend...to my understanding. – guscht Oct 12 '18 at 19:38
  • According to https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/types.html#TimestampType , spark SQL types do wrap some underlying implementation. https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/functions.html#to_timestamp suggests it's something from Java. I can't find the implementation for `sc._jvm.functions` though to say anything more. – ivan_pozdeev Oct 14 '18 at 11:42

1 Answers1

1

Managed to make it work now. As spark seems to not be able to work correctly with milliseconds, I defined a UDF which uses the pytz and datetime packages in order to transform the string to datetime, change the timezone, and then print the string again.

import pytz
from datetime import datetime
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from pyspark import Row
from pyspark import SparkContext

spark = SparkContext.getOrCreate()

def convert_to_utc(timestamp):
    local = pytz.timezone("Arctic/Longyearbyen")
    naive = datetime.strptime(timestamp, '%Y-%m-%d %H:%M:%S,%f')
    local_dt = local.localize(naive, is_dst=None)
    utc_dt = local_dt.astimezone(pytz.utc)
    return utc_dt.strftime('%Y-%m-%d %H:%M:%S,%f')[:-3]

convert_to_utc_udf = udf(lambda timestamp: convert_to_utc(timestamp), StringType())

test = spark.createDataFrame([Row(timestamp = "2018-03-24 14:37:12,133")])
test_2 = test.withColumn('timestamp_2', convert_to_utc_udf('timestamp'))
test_2.write.option('header', True).csv('something')

#Output:
#timestamp,timestamp_2
#"2018-03-24 14:37:12,133","2018-03-24 13:37:12,133"

Inspired by:

How to convert a string column with milliseconds to a timestamp with milliseconds in Spark 2.1 using Scala?

And:

How do I convert local time to UTC in Python?

guscht
  • 843
  • 4
  • 20
  • I don't see how `"2018-03-24 14:37:12,133",2018-03-24T14:37:12.000+01:00,"2018-03-24 14:37:12,000"` is different from before. – ivan_pozdeev Oct 13 '18 at 17:42
  • Could you rather explain what you changed and what specifically made it work, and remove irrelevant parts of the code? – ivan_pozdeev Oct 13 '18 at 17:44
  • Hi Ivan, thank you for your feedback. I will remove the unnecessary parts in the answer. Notice however that the last column of the resulting table was the one that was different. `"2018-03-24 14:37:12,133",2018-03-24T14:37:12.000+01:00,"2018-03-24 14:37:12,000","2018-03-24 13:37:12,133"` – guscht Oct 13 '18 at 20:49
  • You aren't involving [`pyspark.sql.types.TimestampType`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=udf#pyspark.sql.types.TimestampType) this way -- your resulting data only has strings. So the question proper remains unanswered. – ivan_pozdeev Oct 14 '18 at 11:16
  • That is correct. The goal was - as stated - that I want to convert the timestamps from local time to UTC time in spark and want to keep the milliseconds. – guscht Oct 14 '18 at 11:21