1

I would like to convert on a specific column the timestamp in a specific date.

Here is my input :

+----------+
| timestamp|
+----------+
|1532383202|
+----------+

What I would expect :

+------------------+
|      date        |
+------------------+
|24/7/2018 1:00:00 |
+------------------+

If possible, I would like to put minutes and seconds to 0 even if it's not 0.

For example, if I have this :

+------------------+
|      date        |
+------------------+
|24/7/2018 1:06:32 |
+------------------+

I would like this :

+------------------+
|      date        |
+------------------+
|24/7/2018 1:00:00 |
+------------------+

What I tried is :

from pyspark.sql.functions import unix_timestamp

table = table.withColumn(
    'timestamp',
    unix_timestamp(date_format('timestamp', 'yyyy-MM-dd HH:MM:SS'))
)

But I have NULL.

pault
  • 41,343
  • 15
  • 107
  • 149
anthonya
  • 565
  • 2
  • 6
  • 15
  • In almost all cases, you should [avoid using a `udf` when there is an equivalent API function](https://stackoverflow.com/questions/38296609/spark-functions-vs-udf-performance). Depending on the size of your data, it may not make a noticeable difference but it's *much* more efficient to allow all of the processing happen inside the JVM. – pault Aug 24 '18 at 17:30

2 Answers2

5

Update

Inspired by @Tony Pellerin's answer, I realize you can go directly to the :00:00 without having to use regexp_replace():

table = table.withColumn("date", f.from_unixtime("timestamp", "dd/MM/yyyy HH:00:00"))
table.show()
#+----------+-------------------+
#| timestamp|               date|
#+----------+-------------------+
#|1532383202|23/07/2018 18:00:00|
#+----------+-------------------+

Your code doesn't work because pyspark.sql.functions.unix_timestamp() will:

Convert time string with given pattern (‘yyyy-MM-dd HH:mm:ss’, by default) to Unix time stamp (in seconds), using the default timezone and the default locale, return null if fail.

You actually want to do the inverse of this operation, which is convert from an integer timestamp to a string. For this you can use pyspark.sql.functions.from_unixtime():

import pyspark.sql.functions as f

table = table.withColumn("date", f.from_unixtime("timestamp", "dd/MM/yyyy HH:MM:SS"))
table.show()
#+----------+-------------------+
#| timestamp|               date|
#+----------+-------------------+
#|1532383202|23/07/2018 18:07:00|
#+----------+-------------------+

Now the date column is a string:

table.printSchema()
#root
# |-- timestamp: long (nullable = true)
# |-- date: string (nullable = true)

So you can use pyspark.sql.functions.regexp_replace() to make the minutes and seconds zero:

table.withColumn("date", f.regexp_replace("date", ":\d{2}:\d{2}", ":00:00")).show()
#+----------+-------------------+
#| timestamp|               date|
#+----------+-------------------+
#|1532383202|23/07/2018 18:00:00|
#+----------+-------------------+

The regex pattern ":\d{2}" means match a literal : followed by exactly 2 digits.

pault
  • 41,343
  • 15
  • 107
  • 149
2

Maybe you could use the datetime library to convert timestamps to your wanted format. You should also use user-defined functions to work with spark DF columns. Here's what I would do:

# Import the libraries
from pyspark.sql.functions import udf
from datetime import datetime

# Create a function that returns the desired string from a timestamp 
def format_timestamp(ts):
    return datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:00:00')

# Create the UDF
format_timestamp_udf = udf(lambda x: format_timestamp(x))

# Finally, apply the function to each element of the 'timestamp' column
table = table.withColumn('timestamp', format_timestamp_udf(table['timestamp']))

Hope this helps.

pault
  • 41,343
  • 15
  • 107
  • 149
Tony Pellerin
  • 231
  • 1
  • 7
  • 1
    Good edit there. +1 for showing that you can directly convert to `:00:00` without having to use regex. I've added it to my answer as well. Your answer works but it's preferable to [avoid using `udf` when possible](https://stackoverflow.com/questions/38296609/spark-functions-vs-udf-performance). – pault Aug 23 '18 at 15:18