How to change the date_format in Pyspark?

Question

I'm currently working on pyspark and I've a csv file(having a few columns among which I'll display only the date datatype columns) which when opened in Excel looks like this:

Date received   Date sent to company
11/13/2014  11/13/2014
11/13/2014  11/13/2014
11/13/2014  11/13/2014
11/13/2014  11/13/2014
12-11-2014  11/13/2014
12-11-2014  11/13/2014
12-11-2014  11/13/2014
12-11-2014  11-12-2014
12-11-2014  11-12-2014
12-11-2014  11-12-2014
12-11-2014  11-12-2014
12-11-2014  11-12-2014
12-11-2014  11-12-2014
12-11-2014  11-12-2014
12-11-2014  11-12-2014
12-11-2014  11-12-2014
12-11-2014  11-12-2014
12-11-2014  11-12-2014

Here is the screenshot for more clear understanding

As you can see I've used this csv file for my pyspark but I really want to have the date columns in one particular format say: "dd-mm-yyyy".

Can somebody help me with it?!

Although I've tried:

df.select(col("Date_received"),to_date(col("Date_received"),"dd-MM-yyyy").alias("date")) \
  .show()

Which gives the following ouput:

+-------------+----------+
|Date_received|      date|
+-------------+----------+
|   11/13/2014|      null|
|   11/13/2014|      null|
|   11/13/2014|      null|
|   11/13/2014|      null|
|   12-11-2014|2014-11-12|
|   12-11-2014|2014-11-12|
|   12-11-2014|2014-11-12|
|   12-11-2014|2014-11-12|
|   12-11-2014|2014-11-12|
|   12-11-2014|2014-11-12|
|   12-11-2014|2014-11-12|
|   12-11-2014|2014-11-12|
|   12-11-2014|2014-11-12|
|   12-11-2014|2014-11-12|
|   12-11-2014|2014-11-12|
|   12-11-2014|2014-11-12|
|   12-11-2014|2014-11-12|
|   12-11-2014|2014-11-12|
|   12-11-2014|2014-11-12|
|   12-11-2014|2014-11-12|
+-------------+----------+
only showing top 20 rows

Observe how the output for first 4 rows is "null". And also I'm providing "dd-mm-yyyy" then how come the output has a "yyyy-mm-dd" format?

How to tackle this problem? Coz I want to change the date_format here(to "dd-mm-yyyy").

score 0 · Answer 1 · answered Feb 10 '22 at 05:22

To tackle multiple date_formats available within your data , you can parse each one of them into a new column(s) using to_date and then finally coalesce the first non-null value

You can find more on this - Parse Date Format

Available Date Parsing Formats within Spark - https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html

A typical example is as below -

Data Preparation

df = pd.read_csv(StringIO("""
Date received,Date sent to company
11/13/2014,11/13/2014
11/13/2014,11/13/2014
11/13/2014,11/13/2014
11/13/2014,11/13/2014
12-11-2014,11/13/2014
12-11-2014,11/13/2014
12-11-2014,11/13/2014
12-11-2014,11-12-2014
12-11-2014,11-12-2014
12-11-2014,11-12-2014
12-11-2014,11-12-2014
12-11-2014,11-12-2014
12-11-2014,11-12-2014
12-11-2014,11-12-2014
12-11-2014,11-12-2014
12-11-2014,11-12-2014
12-11-2014,11-12-2014
12-11-2014,11-12-2014
"""),delimiter=",")


sparkDF = sql.createDataFrame(df)

sparkDF.show()

+-------------+--------------------+
|Date received|Date sent to company|
+-------------+--------------------+
|   11/13/2014|          11/13/2014|
|   11/13/2014|          11/13/2014|
|   11/13/2014|          11/13/2014|
|   11/13/2014|          11/13/2014|
|   12-11-2014|          11/13/2014|
|   12-11-2014|          11/13/2014|
|   12-11-2014|          11/13/2014|
|   12-11-2014|          11-12-2014|
|   12-11-2014|          11-12-2014|
|   12-11-2014|          11-12-2014|
|   12-11-2014|          11-12-2014|
|   12-11-2014|          11-12-2014|
|   12-11-2014|          11-12-2014|
|   12-11-2014|          11-12-2014|
|   12-11-2014|          11-12-2014|
|   12-11-2014|          11-12-2014|
|   12-11-2014|          11-12-2014|
|   12-11-2014|          11-12-2014|
+-------------+--------------------+

To Date

sparkDF = sparkDF.withColumn('p1',F.to_date(F.col('Date received'),'MM/dd/yyyy'))\
                 .withColumn('p2',F.to_date(F.col('Date received'),'MM-dd-yyyy'))

sparkDF.show()

+-------------+--------------------+----------+----------+
|Date received|Date sent to company|        p1|        p2|
+-------------+--------------------+----------+----------+
|   11/13/2014|          11/13/2014|2014-11-13|      null|
|   11/13/2014|          11/13/2014|2014-11-13|      null|
|   11/13/2014|          11/13/2014|2014-11-13|      null|
|   11/13/2014|          11/13/2014|2014-11-13|      null|
|   12-11-2014|          11/13/2014|      null|2014-12-11|
|   12-11-2014|          11/13/2014|      null|2014-12-11|
|   12-11-2014|          11/13/2014|      null|2014-12-11|
|   12-11-2014|          11-12-2014|      null|2014-12-11|
|   12-11-2014|          11-12-2014|      null|2014-12-11|
|   12-11-2014|          11-12-2014|      null|2014-12-11|
|   12-11-2014|          11-12-2014|      null|2014-12-11|
|   12-11-2014|          11-12-2014|      null|2014-12-11|
|   12-11-2014|          11-12-2014|      null|2014-12-11|
|   12-11-2014|          11-12-2014|      null|2014-12-11|
|   12-11-2014|          11-12-2014|      null|2014-12-11|
|   12-11-2014|          11-12-2014|      null|2014-12-11|
|   12-11-2014|          11-12-2014|      null|2014-12-11|
|   12-11-2014|          11-12-2014|      null|2014-12-11|
+-------------+--------------------+----------+----------+

Coalesce

sparkDF = sparkDF.withColumn('date_received_parsed',F.coalesce(F.col('p1'),F.col('p2')))\
                 .drop(*['p1','p2'])

sparkDF.show()

+-------------+--------------------+--------------------+
|Date received|Date sent to company|date_received_parsed|
+-------------+--------------------+--------------------+
|   11/13/2014|          11/13/2014|          2014-11-13|
|   11/13/2014|          11/13/2014|          2014-11-13|
|   11/13/2014|          11/13/2014|          2014-11-13|
|   11/13/2014|          11/13/2014|          2014-11-13|
|   12-11-2014|          11/13/2014|          2014-12-11|
|   12-11-2014|          11/13/2014|          2014-12-11|
|   12-11-2014|          11/13/2014|          2014-12-11|
|   12-11-2014|          11-12-2014|          2014-12-11|
|   12-11-2014|          11-12-2014|          2014-12-11|
|   12-11-2014|          11-12-2014|          2014-12-11|
|   12-11-2014|          11-12-2014|          2014-12-11|
|   12-11-2014|          11-12-2014|          2014-12-11|
|   12-11-2014|          11-12-2014|          2014-12-11|
|   12-11-2014|          11-12-2014|          2014-12-11|
|   12-11-2014|          11-12-2014|          2014-12-11|
|   12-11-2014|          11-12-2014|          2014-12-11|
|   12-11-2014|          11-12-2014|          2014-12-11|
|   12-11-2014|          11-12-2014|          2014-12-11|
+-------------+--------------------+--------------------+

How to change the date_format in Pyspark?

1 Answers1

Data Preparation

To Date

Coalesce