0

I want to convert string of format [MMM dd yyyy hh:mm] AM/PM to date in Pyspark

Input data:

+-------------------+
|         event_date|
+-------------------+
|Jul  2 2020  5:45PM|
|Jul  3 2020 12:02PM|
|Oct 20 2020 12:07AM|
|Jul  2 2020  6:23PM|
+-------------------+

Expected Output:

+----------+
|event_date|
+----------+
|07-02-2020|
|07-03-2020|
|10-20-2020|
|07-02-2020|
+----------+
MC Emperor
  • 22,334
  • 15
  • 80
  • 130

2 Answers2

2

For newer version of Spark, use this

a1 = spark.createDataFrame([ ('Jul 2 2020 5:45AM',), ('Jul 3 2020 12:02PM', ), ('Oct 20 2020 12:07AM', ),  ('Jul 2 2020 6:23PM', )], ['event_date'])
a1.withColumn("event_date_2", F.to_date('event_date', "MMM d yyyy h:mma")).show()

Output as below:

a1:pyspark.sql.dataframe.DataFrame = [event_date: string]
+-------------------+------------+
|         event_date|event_date_2|
+-------------------+------------+
|  Jul 2 2020 5:45AM|  2020-07-02|
| Jul 3 2020 12:02PM|  2020-07-03|
|Oct 20 2020 12:07AM|  2020-10-20|
|  Jul 2 2020 6:23PM|  2020-07-02|
+-------------------+------------+
prashanth
  • 4,197
  • 4
  • 25
  • 42
1

Your event_date is of the format MMM d yyyy hh:mmaa.
If you want to retain the timestamp with date, then:

from pyspark.sql import functions as F

df.withColumn("event_date_2", F.from_unixtime(F.unix_timestamp("event_date",\
        'MMM d yyyy hh:mmaa'),'MM-dd-yyyy HH:mm:ss')).show()

+-------------------+-------------------+
|         event_date|       event_date_2|
+-------------------+-------------------+
| Jul 2 2020  5:45PM|07-02-2020 17:45:00|
|Oct 20 2020 12:07AM|10-20-2020 00:07:00|
| Jul 3 2020 12:02PM|07-03-2020 12:02:00|
| Jul 2 2020  6:23PM|07-02-2020 18:23:00|
+-------------------+-------------------+

Or if you just want to convert it to date, then use the to_date function.

from pyspark.sql.functions import col

df.withColumn("event_date_2", F.to_date(col('event_date'), "MMM d yyyy hh:mmaa")).show()

+-------------------+------------+
|         event_date|event_date_2|
+-------------------+------------+
| Jul 2 2020  5:45PM|  07-02-2020|
|Oct 20 2020 12:07AM|  10-20-2020|
| Jul 3 2020 12:02PM|  07-03-2020|
| Jul 2 2020  6:23PM|  07-02-2020|
+-------------------+------------+
Cena
  • 3,316
  • 2
  • 17
  • 34
  • Hi @Cena thank you helping me on converting date formats. But i am getting the below error when i applied the code. Py4JJavaError: An error occurred while calling o742.showString. : org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'MMM d yyyy hh:mmaa' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html – Praveen Bandi Nov 12 '20 at 16:20
  • may be because of having 3 spaces for single digit date and 2 spaces for double digit date between month and date? – Praveen Bandi Nov 12 '20 at 16:22
  • The spaces don't matter. The code would work even if there were 5 spaces in-between. I think your problem is with legacy date formats being used in newer versions of Spark (3.0 and above). Set the spark configurations given in [this](https://stackoverflow.com/questions/62602720/string-to-date-migration-from-spark-2-0-to-3-0-gives-fail-to-recognize-eee-mmm/62607317#62607317) answer. – Cena Nov 12 '20 at 17:38
  • great advice and it helps solving my issue. Thank you from the bottom of my heart @Cena – Praveen Bandi Nov 12 '20 at 18:00
  • Glad to help! @PraveenBandi – Cena Nov 12 '20 at 18:48