1

I have a dataframe that have one column A are defined as string column type, but the data in the column are actually date. For example, column A as "20170506" and I want to convert column A to date type. first, I used the following method:

df = df1.withColumn('A', df['A'].cast(DateType())

The result is that all item of column A converted to null. I looked up the official documentation of spark API, and realized the format of column A should be 'yyyy-MM-dd', just like "2017-05-06".

So, I used the another way:

# 1. use udf function
func =  udf (lambda x: datetime.strptime(x, '%Y-%m-%d'), DateType())
df = df1.withColumn('A', func(col('A')))
# 2. use to_date function
df.select(to_date(from_unixtime(unix_timestamp(df['A'], 'yyyy-MM-dd'))).alias('A'))

I don't think either way is a good idea, because it's every complex and not brief.

Why to_date(col) can't setting the format of column string, just like this: to_date(col, 'yyyy-MM-dd'), I think it's very important for us (by the way, Hive Sql can't setting too!).

And, What is the best way to change the column type from String to Date in DataFrames? Thanks:)

1 Answers1

0

You can change data type from String to Date by using the function to_date

dataframe.select(to_date($"dateCol", "dd-MMM-yyyy").alias("date"))

or you can use unix_timestamp

dataframe.select(to_date(unix_timestamp($"datecol", "dd-MMM-yyyy").cast("timestamp")).alias("timestamp"))
koiralo
  • 22,594
  • 6
  • 51
  • 72