1

I have a spark DataFrame with a column "requestTime", which is a string representation of a timestamp. How can I convert it to get this format: YY-MM-DD HH:MM:SS, knowing that I have the following value: 20171107014824952 (which means : 2017-11-07 01:48:25)?

The part devoted to the seconds is formed of 5 digits, in the example above the seconds part is = 24952 and what was displayed in the log file is 25 so I have to round up 24.952 before applying the to_timestamp function, that's why I asked for help.

pault
  • 41,343
  • 15
  • 107
  • 149
DS_Tn
  • 27
  • 7
  • 1
    Possible duplicate of [Convert pyspark string to date format](https://stackoverflow.com/questions/38080748/convert-pyspark-string-to-date-format). – pault Jul 16 '19 at 16:42
  • No it's not the same question, here I have a very different time format. thanks anyway @pault – DS_Tn Jul 17 '19 at 07:58
  • If you read the answer on the linked duplicate you will see that the format can be specified. – pault Jul 17 '19 at 17:05
  • If you have checked the format of my column, you will realize that the part devoted to the seconds is formed of 5 digits, in the example above the seconds part is = 24952 and what was displayed in the log file is 25 so I have to round up 24.952 before applying the "to_timestamp" function, that's why I asked for help – DS_Tn Jul 18 '19 at 08:37
  • In the future, it's helpful to add the details of what you've tried and why it's not working in the question. I have edited those into the question for you. Please read [ask] and [how to create good reproducible spark dataframe examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples) – pault Jul 18 '19 at 14:11

1 Answers1

0

Assuming you have the following spark DataFrame:

df.show()
#+-----------------+
#|      requestTime|
#+-----------------+
#|20171107014824952|
#+-----------------+

With the schema:

df.printSchema()
#root
# |-- requestTime: string (nullable = true)

You can use the techniques described in Convert pyspark string to date format to convert this to a timestamp. Since the solution is dependent on your spark version, I've created the following helper function:

import pyspark.sql.functions as f

def timestamp_from_string(date_str, fmt):
    try:
        """For spark version 2.2 and above, to_timestamp is available"""
        return f.to_timestamp(date_str, fmt)
    except (TypeError, AttributeError):
        """For spark version 2.1 and below, you'll have to do it this way"""
        return f.from_unixtime(f.unix_timestamp(date_str, fmt))

Now call it on your data using the appropriate format:

df.withColumn(
    "requestTime", 
    timestamp_from_string(f.col("requestTime"), "yyyyMMddhhmmssSSS")
).show()
#+-------------------+
#|        requestTime|
#+-------------------+
#|2017-11-07 01:48:24|
#+-------------------+

Unfortunately, this truncates the timestamp instead of rounding.

Therefore, you need to do the rounding yourself before converting. The tricky part is that the number is stored as a string - you'll have to convert it to a double, divide by 1000., convert it back to a long (to chop off the decimal and you can't use int as the number is too big), and finally back to a string.

df.withColumn(
    "requestTime",
    timestamp_from_string(
        f.round(f.col("requestTime").cast("double")/1000.0).cast('long').cast('string'),
        "yyyyMMddhhmmss"
    )
).show()
#+-------------------+
#|        requestTime|
#+-------------------+
#|2017-11-07 01:48:25|
#+-------------------+
pault
  • 41,343
  • 15
  • 107
  • 149