You were almost there!
The problem in your case was that the to_date
function you were using did not understand what format the strings were written in. There is also a to_date
function that accepts a fmt
string to make Spark understand how to parse your strings.
That would looks something like this:
import spark.implicits._
// This is just the data you posted in a comment on your question
val df = spark.read.option("sep", ";").option("header", "true").csv("./dateTimeCSV.csv")
df.show
+---+----------+---------+---------+-----+
| Id| Date| Fname| Lname|Route|
+---+----------+---------+---------+-----+
| 1| 1/1/2017| Greg| Rollo| S|
| 2| 1/2/2017| Frankie| Allen| N|
| 2| 1/3/2017| Denise| Smith| E|
| 2| 2/11/2017| Earlene| Lane| W|
| 2| 3/22/2017| Donovan| Joseph| SE|
| 3| 3/11/2017| Rebecca| Bowls| NE|
| 3| 4/5/2017|Katherine| Cook| NW|
| 4| 5/16/2017| Alicia| Mason| SW|
| 4| 6/9/2017| Bob| Peterson| S|
| 4| 6/30/2017| Janet| Love| N|
| 5| 7/2/2017| Richard| Dingle| E|
| 5| 8/25/2017| Thomas|Velasquez| W|
| 5| 8/10/2017| Susan| King| SE|
| 1| 9/25/2017| Pratesh| Venkat| NE|
| 1|10/14/2017| Neha| Kumar| NW|
| 1| 11/2/2017| Louis| Williams| SW|
| 3|11/28/2017| Winston| McLean| E|
| 3| 12/2/2017| Clark| Kent| N|
| 3|12/14/2017| Bruce| Wayne| S|
+---+----------+---------+---------+-----+
val output = df.withColumn("Month",date_format(to_date(df("Date"), "MM/dd/yyyy"), "MM"))
output.show
+---+----------+---------+---------+-----+-----+
| Id| Date| Fname| Lname|Route|Month|
+---+----------+---------+---------+-----+-----+
| 1| 1/1/2017| Greg| Rollo| S| 01|
| 2| 1/2/2017| Frankie| Allen| N| 01|
| 2| 1/3/2017| Denise| Smith| E| 01|
| 2| 2/11/2017| Earlene| Lane| W| 02|
| 2| 3/22/2017| Donovan| Joseph| SE| 03|
| 3| 3/11/2017| Rebecca| Bowls| NE| 03|
| 3| 4/5/2017|Katherine| Cook| NW| 04|
| 4| 5/16/2017| Alicia| Mason| SW| 05|
| 4| 6/9/2017| Bob| Peterson| S| 06|
| 4| 6/30/2017| Janet| Love| N| 06|
| 5| 7/2/2017| Richard| Dingle| E| 07|
| 5| 8/25/2017| Thomas|Velasquez| W| 08|
| 5| 8/10/2017| Susan| King| SE| 08|
| 1| 9/25/2017| Pratesh| Venkat| NE| 09|
| 1|10/14/2017| Neha| Kumar| NW| 10|
| 1| 11/2/2017| Louis| Williams| SW| 11|
| 3|11/28/2017| Winston| McLean| E| 11|
| 3| 12/2/2017| Clark| Kent| N| 12|
| 3|12/14/2017| Bruce| Wayne| S| 12|
+---+----------+---------+---------+-----+-----+
As you can see, I added a fmt
string with as value "MM/dd/yyyy"
to make Spark understand how to parse the date.
Hope this helps!