I have the following DataFrame
containing the date format
- yyyyMMddTHH:mm:ss+UTC
Data Preparation
sparkDF = sql.createDataFrame([("20201021T00:00:00+0530",),
("20211011T00:00:00+0530",),
("20200212T00:00:00+0300",),
("20211021T00:00:00+0530",),
("20211021T00:00:00+0900",),
("20211021T00:00:00-0500",)
]
,['timestamp'])
sparkDF.show(truncate=False)
+----------------------+
|timestamp |
+----------------------+
|20201021T00:00:00+0530|
|20211011T00:00:00+0530|
|20200212T00:00:00+0300|
|20211021T00:00:00+0530|
|20211021T00:00:00+0900|
|20211021T00:00:00-0500|
+----------------------+
I m aware of the date format
to parse and convert the values to DateType
Timestamp Parsed
sparkDF.select(F.to_date(F.col('timestamp'),"yyyyMMdd'T'HH:mm:ss+0530").alias('timestamp_parsed')).show()
+----------------+
|timestamp_parsed|
+----------------+
| 2020-10-21|
| 2021-10-11|
| null|
| 2021-10-21|
| null|
| null|
+----------------+
As you can see , its specific to +0530
strings , I m aware of the fact that I can use multiple patterns and coalesce
the first non-null values
Multiple Patterns & Coalesce
sparkDF.withColumn('p1',F.to_date(F.col('timestamp'),"yyyyMMdd'T'HH:mm:ss+0530"))\
.withColumn('p2',F.to_date(F.col('timestamp'),"yyyyMMdd'T'HH:mm:ss+0900"))\
.withColumn('p3',F.to_date(F.col('timestamp'),"yyyyMMdd'T'HH:mm:ss-0500"))\
.withColumn('p4',F.to_date(F.col('timestamp'),"yyyyMMdd'T'HH:mm:ss+0300"))\
.withColumn('timestamp_parsed',F.coalesce(F.col('p1'),F.col('p2'),F.col('p3'),F.col('p4')))\
.drop(*['p1','p2','p3','p4'])\
.show(truncate=False)
+----------------------+----------------+
|timestamp |timestamp_parsed|
+----------------------+----------------+
|20201021T00:00:00+0530|2020-10-21 |
|20211011T00:00:00+0530|2021-10-11 |
|20200212T00:00:00+0300|2020-02-12 |
|20211021T00:00:00+0530|2021-10-21 |
|20211021T00:00:00+0900|2021-10-21 |
|20211021T00:00:00-0500|2021-10-21 |
+----------------------+----------------+
Is there a better way to accomplish this, as there might be a bunch of other UTC
within the data source, is there a standard UTC TZ
available within Spark to parse all the cases