2

I'm trying to convert text_date which can contain timestamp in any possible format. Python library dateutil does this for me upto a great extent. If the parser fails, i want to assign None to the field. But i want to implement the same in my PySpark (Glue job) script. But I'm getting an issue where the job simply keeps executing without any errors infinitely.

Current implementation:

from dateutil import parser as dateutil_parser

# ...

dateutil_parser_udf = udf(lambda x: dateutil_parser.parse(x).strftime('%Y-%m-%d %H:%M:%S'), StringType())

# Clean Text Date
def parse_date(x):
    text_date = x
    if not text_date:
        return None
    parsed_date = None
    try:
        parsed_date = dateutil_parser_udf(x)
    except Exception:
        parsed_date = None
    return parsed_date
    
parse_dateUDF = udf(lambda x: parse_date(x), StringType())

mydf = (mydf.withColumn("text_date_cleaned", parse_dateUDF(col("text_date"))))

Input:

id,text_date,date_created,my_data
1,"30-Apr-21",2021-04-30 0:26:43,text1
2,"30Apr21",2021-04-30 7:38:16,text2
3,"30th April, 2021",2021-04-30 10:32:30,text3
4,"03042021",2021-04-30 13:14:06,text4
5,"30Apr21 xyz",2021-04-30 13:21:02,text5
6,"30/04/21",2021-04-30 15:41:24,text6
7,"4/30/2021 7:37:26",2021-04-30 16:00:49,text7
8,"30-04-2021",2021-04-30 16:09:38,text8
9,"2021-04-30 2:22:7 PM",2021-04-30 16:12:37,text9

Expected Output:

id,text_date,date_created,my_data
1,2021-04-30 00:00:00,2021-04-30 0:26:43,text1
2,2021-04-30 00:00:00,2021-04-30 7:38:16,text2
3,2021-04-30 00:00:00,2021-04-30 10:32:30,text3
4,2021-04-30 00:00:00,2021-04-30 13:14:06,text4
5,None,2021-04-30 13:21:02,text5
6,2021-04-30 00:00:00,2021-04-30 15:41:24,text6
7,2021-04-30 07:37:26,2021-04-30 16:00:49,text7
8,2021-04-30 00:00:00,2021-04-30 16:09:38,text8
9,2021-04-30 14:22:07,2021-04-30 16:12:37,text9

Please help me debug this script.

ProgramSpree
  • 372
  • 5
  • 21
  • outline what are your inputs and the expected output in a minimal reproducible way - https://www.stackoverflow.com/help/minimal-reproducible-example – Vaebhav Apr 25 '22 at 05:25
  • @Vaebhav have added input & output ... hope this helps! – ProgramSpree Apr 25 '22 at 11:46
  • Go Through this answer , using a udf is slower and there are ways you can implement the same behaviour in native pyspark - https://stackoverflow.com/questions/69187606/parse-date-format – Vaebhav Apr 25 '22 at 12:31
  • Additional Links - https://stackoverflow.com/questions/71043147/how-to-convert-a-string-column-to-date-column-for-a-pyspark-dataframe-using-strp/71044520#71044520 https://stackoverflow.com/questions/69360147/parse-different-formats-of-date-in-string-format-to-date-format-pyspark-when-cla/69360664#69360664 – Vaebhav Apr 25 '22 at 12:33
  • Thanks for sharing these links, @Vaebhav, but my problem is that I can't predict the format the timestamp is going to be received in - this is exemplified in the input shared above. Therefore, I need to implement the dateutil library. Any suggestions around this? – ProgramSpree Apr 26 '22 at 04:38
  • You should adopt the idea to tackle `N` number of formats , instead of relaying on date_util_parser , I blv if you analyse your dataset , you ll be able to identify a set of dates that are always consistently present and build your code accordingly – Vaebhav Apr 26 '22 at 04:43
  • But my dataset contains texts from various sources and the inherent nature of the dataset is that the formats are ever changing – ProgramSpree Apr 26 '22 at 04:44

0 Answers0