I'm currently working on pyspark and I've a csv file(having a few columns among which I'll display only the date datatype columns) which when opened in Excel looks like this:
Date received Date sent to company
11/13/2014 11/13/2014
11/13/2014 11/13/2014
11/13/2014 11/13/2014
11/13/2014 11/13/2014
12-11-2014 11/13/2014
12-11-2014 11/13/2014
12-11-2014 11/13/2014
12-11-2014 11-12-2014
12-11-2014 11-12-2014
12-11-2014 11-12-2014
12-11-2014 11-12-2014
12-11-2014 11-12-2014
12-11-2014 11-12-2014
12-11-2014 11-12-2014
12-11-2014 11-12-2014
12-11-2014 11-12-2014
12-11-2014 11-12-2014
12-11-2014 11-12-2014
Here is the screenshot for more clear understanding
As you can see I've used this csv file for my pyspark but I really want to have the date columns in one particular format say: "dd-mm-yyyy".
Can somebody help me with it?!
Although I've tried:
df.select(col("Date_received"),to_date(col("Date_received"),"dd-MM-yyyy").alias("date")) \
.show()
Which gives the following ouput:
+-------------+----------+
|Date_received| date|
+-------------+----------+
| 11/13/2014| null|
| 11/13/2014| null|
| 11/13/2014| null|
| 11/13/2014| null|
| 12-11-2014|2014-11-12|
| 12-11-2014|2014-11-12|
| 12-11-2014|2014-11-12|
| 12-11-2014|2014-11-12|
| 12-11-2014|2014-11-12|
| 12-11-2014|2014-11-12|
| 12-11-2014|2014-11-12|
| 12-11-2014|2014-11-12|
| 12-11-2014|2014-11-12|
| 12-11-2014|2014-11-12|
| 12-11-2014|2014-11-12|
| 12-11-2014|2014-11-12|
| 12-11-2014|2014-11-12|
| 12-11-2014|2014-11-12|
| 12-11-2014|2014-11-12|
| 12-11-2014|2014-11-12|
+-------------+----------+
only showing top 20 rows
Observe how the output for first 4 rows is "null". And also I'm providing "dd-mm-yyyy" then how come the output has a "yyyy-mm-dd" format?
How to tackle this problem? Coz I want to change the date_format here(to "dd-mm-yyyy").