10

I'm trying to change my column type from string to date. I have consulted answers from:

  1. How to change the column type from String to Date in DataFrames?
  2. Why I get null results from date_format() PySpark function?

When I tried to apply answers from link 1, I got null result instead, so I referred to answer from link 2 but I don't understand this part:

output_format = ...  # Some SimpleDateFormat string
Nimantha
  • 6,405
  • 6
  • 28
  • 69
Tata
  • 147
  • 1
  • 1
  • 9

3 Answers3

14
from pyspark.sql.functions import col, unix_timestamp, to_date

#sample data
df = sc.parallelize([['12-21-2006'],
                     ['05-30-2007'],
                     ['01-01-1984'],
                     ['12-24-2017']]).toDF(["date_in_strFormat"])
df.printSchema()

df = df.withColumn('date_in_dateFormat', 
                   to_date(unix_timestamp(col('date_in_strFormat'), 'MM-dd-yyyy').cast("timestamp")))
df.show()
df.printSchema()

Output is:

root
 |-- date_in_strFormat: string (nullable = true)

+-----------------+------------------+
|date_in_strFormat|date_in_dateFormat|
+-----------------+------------------+
|       12-21-2006|        2006-12-21|
|       05-30-2007|        2007-05-30|
|       01-01-1984|        1984-01-01|
|       12-24-2017|        2017-12-24|
+-----------------+------------------+

root
 |-- date_in_strFormat: string (nullable = true)
 |-- date_in_dateFormat: date (nullable = true)
Zoe
  • 27,060
  • 21
  • 118
  • 148
Prem
  • 11,775
  • 1
  • 19
  • 33
  • Oh gosh, this helped but only partially :( some dates still returned null values. Like only some got converted? – Tata Dec 23 '17 at 15:18
  • You would need to check the date format in your string column. It should be in `MM-dd-yyyy` else it'll return `null`. – Prem Dec 23 '17 at 15:20
  • The original string for my date is written in dd/MM/yyyy. I used that in the code you have written, and like I said only some got converted into date type.... – Tata Dec 23 '17 at 15:25
5

simple way:

from pyspark.sql.types import *
df_1 = df.withColumn("col_with_date_format",
df["col_with_date_format"].cast(DateType()))
KeepLearning
  • 517
  • 7
  • 10
4

Here is a more easy way by using default to_date function:

from pyspark.sql import functions as F
df= df.withColumn('col_with_date_format',F.to_date(df.col_with_str_format))
Manish Singla
  • 381
  • 1
  • 4
  • 11