Changing string to timestamp in Pyspark

Question

I'm trying to convert a string column to Timestamp column which is in the format:

c1	c2
2019-12-10 10:07:54.000	2019-12-13 10:07:54.000
2020-06-08 15:14:49.000	2020-06-18 10:07:54.000

from pyspark.sql.functions import col, udf, to_timestamp

joined_df.select(to_timestamp(joined_df.c1, '%Y-%m-%d %H:%M:%S.%SSSS').alias('dt')).collect()
joined_df.select(to_timestamp(joined_df.c2, '%Y-%m-%d %H:%M:%S.%SSSS').alias('dt')).collect()

When the dates are changed, I want a new column Date difference by subtracting c2-c1

In python I'm doing it:

df['c1']        = df['c1'].fillna('0000-01-01').apply(lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S.%f'))

df['c2'] = df['c2'].fillna('0000-01-01').apply(lambda x:  datetime.strptime(x, '%Y-%m-%d %H:%M:%S.%f'))

df['days']     = (df['c2'] - df['c1']).apply(lambda x: x.days)

Can anyone help how to convert to pyspark?

Does this answer your question? [Convert pyspark string to date format](https://stackoverflow.com/questions/38080748/convert-pyspark-string-to-date-format) — blackbishop, Feb 08 '21 at 19:33
i used the same format, i dont know i have .000 after seconds to — user12063090, Feb 08 '21 at 19:36

mck · Accepted Answer · 2021-02-08T19:41:14.073

0

If you want to get the date difference, you can use datediff:

import pyspark.sql.functions as F

df = df.withColumn('c1', F.col('c1').cast('timestamp')).withColumn('c2', F.col('c2').cast('timestamp'))
result = df.withColumn('days', F.datediff(F.col('c2'), F.col('c1')))
result.show(truncate=False)
+-----------------------+-----------------------+----+
|c1                     |c2                     |days|
+-----------------------+-----------------------+----+
|2019-12-10 10:07:54.000|2019-12-13 10:07:54.000|3   |
|2020-06-08 15:14:49.000|2020-06-18 10:07:54.000|10  |
+-----------------------+-----------------------+----+

edited Feb 08 '21 at 19:41

answered Feb 08 '21 at 19:32

mck

40,932
13
35
50

No need to change format. The timestamp formats in your dataframe are standard and you can directly manipulate them. – mck Feb 08 '21 at 19:36
But I also want to change format to use for other formulation – user12063090 Feb 08 '21 at 19:38
@user12063090 you can cast the column to timestamp, as shown in the edited answer. – mck Feb 08 '21 at 19:41

Changing string to timestamp in Pyspark

1 Answers1