I'm trying to convert text_date
which can contain timestamp in any possible format. Python library dateutil does this for me upto a great extent.
If the parser fails, i want to assign None to the field.
But i want to implement the same in my PySpark (Glue job) script.
But I'm getting an issue where the job simply keeps executing without any errors infinitely.
Current implementation:
from dateutil import parser as dateutil_parser
# ...
dateutil_parser_udf = udf(lambda x: dateutil_parser.parse(x).strftime('%Y-%m-%d %H:%M:%S'), StringType())
# Clean Text Date
def parse_date(x):
text_date = x
if not text_date:
return None
parsed_date = None
try:
parsed_date = dateutil_parser_udf(x)
except Exception:
parsed_date = None
return parsed_date
parse_dateUDF = udf(lambda x: parse_date(x), StringType())
mydf = (mydf.withColumn("text_date_cleaned", parse_dateUDF(col("text_date"))))
Input:
id,text_date,date_created,my_data
1,"30-Apr-21",2021-04-30 0:26:43,text1
2,"30Apr21",2021-04-30 7:38:16,text2
3,"30th April, 2021",2021-04-30 10:32:30,text3
4,"03042021",2021-04-30 13:14:06,text4
5,"30Apr21 xyz",2021-04-30 13:21:02,text5
6,"30/04/21",2021-04-30 15:41:24,text6
7,"4/30/2021 7:37:26",2021-04-30 16:00:49,text7
8,"30-04-2021",2021-04-30 16:09:38,text8
9,"2021-04-30 2:22:7 PM",2021-04-30 16:12:37,text9
Expected Output:
id,text_date,date_created,my_data
1,2021-04-30 00:00:00,2021-04-30 0:26:43,text1
2,2021-04-30 00:00:00,2021-04-30 7:38:16,text2
3,2021-04-30 00:00:00,2021-04-30 10:32:30,text3
4,2021-04-30 00:00:00,2021-04-30 13:14:06,text4
5,None,2021-04-30 13:21:02,text5
6,2021-04-30 00:00:00,2021-04-30 15:41:24,text6
7,2021-04-30 07:37:26,2021-04-30 16:00:49,text7
8,2021-04-30 00:00:00,2021-04-30 16:09:38,text8
9,2021-04-30 14:22:07,2021-04-30 16:12:37,text9
Please help me debug this script.