1

I have a data frame with date/time column storing data as binary data type. I need to convert it the actual datetime data type so I can run SQL window functions and more. Hence, looking for some working examples.

Input dataframe schema: root |-- ce_time: binary (nullable = true)

Sample data:

 +-------------------------------------------------------------------------------------+
 |ce_time                                                                              | 
 +-------------------------------------------------------------------------------------+
 |[32 30 32 32 2D 30 35 2D 30 32 54 30 30 3A 30 34 3A 33 39 2E 32 30 34 36 37 38 35 5A]|
 |[32 30 32 32 2D 30 35 2D 30 32 54 30 30 3A 30 34 3A 34 36 2E 38 32 33 32 34 32 5A]   |
 |[32 30 32 32 2D 30 35 2D 30 32 54 30 30 3A 30 34 3A 35 34 2E 34 35 39 30 34 33 37 5A]|
 |[32 30 32 32 2D 30 35 2D 30 32 54 30 30 3A 30 35 3A 30 32 2E 35 37 30 38 35 39 36 5A]|
 +-------------------------------------------------------------------------------------+

I can convert the above to string, it looks like this, but I need it in date/time type, not string.

 +-------------------------------------------------------------------------------------+----------------------------+
 |ce_time                                                                              |ce_time_string              |
 +-------------------------------------------------------------------------------------+----------------------------+
 |[32 30 32 32 2D 30 35 2D 30 32 54 30 30 3A 30 34 3A 33 39 2E 32 30 34 36 37 38 35 5A]|2022-05-02T00:04:39.2046785Z|
 |[32 30 32 32 2D 30 35 2D 30 32 54 30 30 3A 30 34 3A 34 36 2E 38 32 33 32 34 32 5A]   |2022-05-02T00:04:46.823242Z |
 |[32 30 32 32 2D 30 35 2D 30 32 54 30 30 3A 30 34 3A 35 34 2E 34 35 39 30 34 33 37 5A]|2022-05-02T00:04:54.4590437Z|
 |[32 30 32 32 2D 30 35 2D 30 32 54 30 30 3A 30 35 3A 30 32 2E 35 37 30 38 35 39 36 5A]|2022-05-02T00:05:02.5708596Z|
 +-------------------------------------------------------------------------------------+----------------------------+

If someone knows how to convert binary to date/time in PySpark and keep the above date/time values exactly the same, please share!

Much appreciated!!

bda
  • 372
  • 1
  • 7
  • 22
  • have you tried to parse the new string column to timestamp type like the answer [here](https://stackoverflow.com/a/45558713/9477843)? – AdibP May 04 '22 at 03:48

1 Answers1

1

You can first convert your binary value to string value using decode built-in function and then convert this string value to timestamp value using cast column method

Note: timestamp is spark equivalent of python's datetime type

Here is the complete code:

from pyspark.sql import functions as F

result = df.withColumn('ce_time', F.decode('ce_time', 'UTF-8').cast('timestamp'))

With the following df input dataframe:

+-------------------------------------------------------------------------------------+
|ce_time                                                                              |
+-------------------------------------------------------------------------------------+
|[32 30 32 32 2D 30 35 2D 30 32 54 30 30 3A 30 34 3A 33 39 2E 32 30 34 36 37 38 35 5A]|
|[32 30 32 32 2D 30 35 2D 30 32 54 30 30 3A 30 34 3A 34 36 2E 38 32 33 32 34 32 5A]   |
|[32 30 32 32 2D 30 35 2D 30 32 54 30 30 3A 30 34 3A 35 34 2E 34 35 39 30 34 33 37 5A]|
|[32 30 32 32 2D 30 35 2D 30 32 54 30 30 3A 30 35 3A 30 32 2E 35 37 30 38 35 39 36 5A]|
+-------------------------------------------------------------------------------------+

You get the following result dataframe:

+--------------------------+
|ce_time                   |
+--------------------------+
|2022-05-02 02:04:39.204678|
|2022-05-02 02:04:46.823242|
|2022-05-02 02:04:54.459043|
|2022-05-02 02:05:02.570859|
+--------------------------+

having the following schema:

root
 |-- ce_time: timestamp (nullable = true)
Vincent Doba
  • 4,343
  • 3
  • 22
  • 42